Page 116 - AIH-1-2
P. 116
Artificial Intelligence in Health Medical instruction-tuning for Japanese LLMs
Table 1. LoRA/QLoRA parameters “problem_text”: “Which of the following is incorrect
regarding hypertension caused by obstructive sleep
OpenCALM‑7B Llama2‑70B apnea?,”
Fine-tuning method LoRA QLoRA “choices”: {“a”: “It often leads to nocturnal hypertension.”,
Learning rate 5e -5 2e-4 “b”: “Weight reduction is recommended for obese
Input length 512 512 patients.”, “c”: “Alpha-blockers are the first-line choice of
Target max length 512 512 medication.”, “d”: “Morning hypertension is frequently
Batch size 8 8 observed in home blood pressure measurements.”, “e”:
Fine-tuning steps 1k, 3k, 10k 0.9k, 3k “Continuous positive airway pressure (CPAP) therapy
is expected to lower blood pressure.”},
r of (Q) LoRA 8 64 “text_only”: True,
α of (Q) LoRA 32 16 “answer”: [“c”]
Dropout rate of (Q) 0.05 0.1 An example from JJSIMQA, 5-choice questions in
LoRA JJSIM (originally in Japanese)
Target parameter Query, Key, value All linear layers “problem_text”: “Which of the following is incorrect
about recent cases of hepatitis B in Japan? Choose one.”,
as “instruction”: Question content, “output”: Answer “choices”: {“a”: “The HBs antigen positivity rate has
content, and do not include line breaks. Repeat this significantly decreased due to the initiation of mother-
process 15 times and list one data pair per line. to-child infection prevention programs.”, “b”: “HBV
### Input: {input_text} (hepatitis B virus) genotype Ae can become a carrier
through horizontal transmission in adults.”, “c”: “In
The number of epochs and steps was set to align with Japan, routine HBV vaccination began in October
the overall computational time in each experiment. Using 2016.”, “d”: “HBV genotype C is more prevalent in
a larger model such as Llama2-70B increases the GPU the Tohoku and Miyako-Yaeyama regions.”, “e”:
memory usage per sample. To avoid this, memory usage can “Horizontal transmission of HBV during childhood
be reduced by decreasing the floating-point precision or is thought to be partly attributed to father-to-child
by using gradient accumulation. In this study, we adopted transmission and communal living.”},
4-bit QLoRA on Llama2-70B. Since 4 bits is optimal in “text_only”: True,
terms of the relationship between floating-point precision “answer”: [“d”]
and model performance, it is not desirable to reduce the
27
The prompt template used for the evaluation follows
floating-point precision any further. To experiment with the Alpaca-format, where “problem_text” is incorporated
32
less GPU memory, gradient accumulation was attempted in {instruction} and “choices” is incorporated in {input}:
by multiplying batch size calculation, for example, a batch
size of 8 is calculated twice with four smaller mini-batch Below is an instruction that describes a task, paired
sizes. This approach allows for building larger models and with an input that provides further context. Write a
reducing requirements for computing resources. response that appropriately completes the request.
### Instruction:
3.3. Evaluation by medical question-answering tasks {Instruction}
The state-of-the-art performance of English medical LLMs ### Input:
is typically evaluated using benchmark datasets such as {Input}
MedQA (United States Medical Licensing Examination, ### Response:
USMLE), MedMCQA, and PubMedQA. However, For evaluation in our experiments, these prompts were
28
29
30
the availability of Japanese-curated medical task datasets given in Japanese for OpenCALM-7B and in English for
is significantly limited, with IgakuQA (Japanese medical Llama2-70B. When generating the responses, we can
licensing exams) being the only one available at present. specify parameters. In our experiments, temperature
31
Hence, in addition to IgakuQA, we prepared a new Q&A was set to 0.1, max_new_tokens to 256, top_p to 0.9, and
dataset JJSIMQA to assess the performance of each model in repetition_penalty to 1.05. Question-answering samples
the medical domain. JJSIMQA is our own dataset comprising that yielded null responses were excluded from the dataset.
5-choice questions included in JJSIM as appendices. Here Finally, we evaluated the output responses of each
are some samples from IgakuQA and JJSIMQA datasets:
model by three different metrics: Exact match, Gestalt
An example from IgakuQA (originally in Japanese) score, and Accuracy. While all these metrics aim to assess
“problem_id”: “116A1”, how effectively models can select the correct choice from
Volume 1 Issue 2 (2024) 110 doi: 10.36922/aih.2695

