Page 114 - AIH-1-2
P. 114
Artificial Intelligence in Health Medical instruction-tuning for Japanese LLMs
Domain adaptation remains a crucial approach i. How and how much can domain knowledge be
for tailoring mainstream LLMs to the practical use in incorporated into LLMs by LoRA-based fine-tuning?
clinical environments, even after the surge of ChatGPT ii. Do larger English-centric LLMs outperform smaller
(https://chat.openai.com/), a powerful LLM service, that Japanese-centric LLMs?
has revolutionized the way we interact with text and iii. Does the amount of fine-tuning hold significance?
language by its astonishing ability to generate sentences. To answer these questions, we conducted a
While these general-purpose models are powerful in zero- comprehensive comparison between different LLMs fine-
shot inference in unseen tasks, fine-tuned models may tuned with our own Japanese medical dataset by evaluating
have the potential to outperform them in domain-specific e ach model through medical question-answering approach.
tasks. Several works on domain adaptation within the This enables us to clarify the strengths and limitations of
medical field in the context of powerful English-centric incorporating domain-specific knowledge by LoRA, setting
LLMs exist as well, but research in this direction is largely the stage for constructing enhanced versions of various
1-4
lacking in Japanese, highlighting the need to pioneer domain-specific Japanese LLMs.
studies in non-English contexts. The drive to develop
large-scale medical LLMs in one’s native language is not 2. Related works
only prevalent in Japan but also starting to mainstream in
other non-English-speaking countries. In Japan, the sole In recent years, there has been active research in
precedent in the area of Japanese medical language model constructing pretrained language models specialized for
12
is the work of Sugimoto et al., who developed a Japanese the medical domain. Before the emergence of GPT-3
5
medical language model named JMedRoBERTa based on in 2020 and ChatGPT in 2022, the prevailing trend in
6
RoBERTa, a BERT -based model. This study is the first research involved building BERT -based language models
6
exploration along this line using large-scale GPT-models and evaluating them in classification tasks. In English-
14
13
with a focus on text generation. speaking regions, models such as BioBERT, Med-BERT,
ClinicalBERT, and PubMedBERT have been proposed,
15
16
Moreover, ChatGPT utilization is impeded in clinical leveraging medical literature databases such as PubMed
practices due to the concerns related to data privacy and clinical records databases such as MIMIC-III. Also
17
and security. The potential risks associated with data in Japan, UTH-BERT and JMedRoBERTa have become
5
18
breaches or misuse of confidential patient information available online. UTH-BERT is the first medical pretrained
18
underscore the need for robust security measures and language model in Japanese, pretrained by approximately
ethical considerations, further complicating its seamless 120 million lines of clinical texts. On the other hand,
integration into clinical settings. Hence, we need to JMedRoBERTa utilizes 11 million lines of journal articles
5
consider domain adaptation using other LLMs for in medicine, with the goal of accumulating information
incorporating medical knowledge. across a diverse range of content, encompassing basic
Recently, several parameter- efficient fine-tuning research to case studies.
methods have been proposed, including low-rank In the wake of GPT-3 and ChatGPT emergence,
12
7,8
adaptation (LoRA) and its quantized version (QLoRA), the focus of research shifted toward LLMs leveraging
where only the limited parameters are chosen as the target Transformer accompanied with a steady increase in the
19
of the fine-tuning. Performed along with instruction- parameter size of models. The primary tasks of interest
tuning, LoRA has demonstrated some success in acquiring in research also transitioned from classification tasks to
conversational abilities and improving domain-specific medical text generation or medical question-answering.
performances such as financial question-answering For the English-centric model, BioMedLM (formerly
9,10
tasks. That being said, the ability and limitation of LoRA- known as PubMedGPT), BioGPT, and BioMedGPT
20
21
22
based instruction-tuning have not been clarified in domain have been proposed, harnessing the strength of the latest
adaptation. “Superficial Alignment Hypotheses,” which general-purpose LLMs. However, the currently available
was proposed recently, provide a conjecture that fine- models have limited sizes: BioMedLM has 2.7 billion
20
tuning does not contribute significantly to the acquisition parameters, BioGPT is based on the GPT-2 architecture
21
23
of knowledge, but this topic remains controversial. with 1.3 billion parameters, and BioMedGPT comprises
11
22
Therefore, we aim to investigate whether LoRA-based 10 billion parameters. On the other hand, Google has
instruction tuning can be effective in acquiring domain- pursued its own path in developing medical models,
specific knowledge, especially medical knowledge. including Med-PaLM and Med-PaLM2 with 540 billion
1
2
The primary research questions guiding our study are and 340 billion parameters, respectively; nonetheless, these
as follows: models are not accessible to the public. To the best of our
Volume 1 Issue 2 (2024) 108 doi: 10.36922/aih.2695

