Page 119 - AIH-1-2
P. 119
Artificial Intelligence in Health Medical instruction-tuning for Japanese LLMs
Table 5. Some example outputs from Llama2‑70B (0.9k steps of QLoRA)
Correct choice Model’s output Exact match Gestalt score Accuracy
Workers’ accident compensation insurance Similar but having tokenizing issue (労働者補 0 0.538 1
(労働者災害補償保険 ) 償保険)
Functional menstrual disorders (機能性月経困難症 ) Correct except third letter (機能的月経困難症) 0 0.736 1
Intra-articular hemorrhage (関節内出血 ) Correct except the second letter (関篾内出血) 0 0.800 1
Monthly standard compensation for patient households Similar but having tokenizing issue (患者世帯 0 0.666 1
(患者世帯の標準報酬月額 ) 患者世帯標準報酬月額)
Table 6. Example responses from each model wandb.me/nejumi) can assess the fluency of generated
texts, they do not adequately evaluate the accuracy
Model name LoRA/ Response (exact output from LLM) of domain-specific knowledge. It is noteworthy that
QLoRA
three metrics used in our experiments also exhibit
OpenCALM-7B None At the bedside, the nurse encouraged certain shortcomings. For example, Exact match cannot
the patient to move their legs to assist in
getting out of bed. The patient attempted accurately score responses that, while conveying the
to flex and extend their legs; however, correct meaning, do not match the text verbatim. Gestalt
they were unable to move the leg joints score is asymmetric and prone to multiple choices.
and needed assistance to stand up Overall, our string-based metrics fall short in identifying
independently. The nurse approached the phrases with different expressions but conveying the
patient sitting on a bedside chair, provided
shoulder support, lifted their arms, same meaning, and reflecting aspects such as fluency
extended their elbows, and helped them and medical accuracy. We argue that these features are
rise slowly. The nurse then gently released not problematic in question-answering tasks where the
their grip and returned to the chair. The model is required to output one or a few choices in short
nurse proceeded to call the next patient.* texts, but they become problematic when evaluating LLM
1k steps of The procedure to be carried out for for practical tasks, including medical report generation,
LoRA patients with the following disease where these aspects are crucial.
involves the implementation of
appropriate treatment and preventive Furthermore, even the use of multiple-choice questions
measures.* for evaluating LLMs has been controversial. 36,37 The
3k steps of Detailed treatment methods have not development of even more superior evaluation metrics is
LoRA been provided.* eagerly anticipated.
Llama2-70B None In some cases, a procedure called
thrombol. 5.4. Difficulty and limitations
0.9k steps The treatment of deep vein thrombosis While numerous LLM training techniques are still in the
of QLoRA involves the use of thrombolytic agents
and anticoagulants.* developmental stage, several shortcomings of training
medical LLMs, like what we have done in this work,
3k steps of The treatment for deep vein thrombosis
QLoRA includes pharmacological therapy to should be highlighted. First and foremost, the quantity
dissolve the blood clot as well as surgical and quality of data could be insufficient in our work.
interventions to remove the thrombus.* Preparing a medical dataset in instructional format can
Note: *Originally in Japanese. be expensive. In this study, we employed ChatGPT for
Abbreviation: LLMs: Large language models. automated generation, but this approach may become
financially burdensome when preparing larger datasets.
responses provided within the context, whereas Llama2-70B Data cleansing has also consistently posed challenges, and
retains this ability even after instruction-tuning. achieving perfect results in this work may not have been
feasible.
5.3. Evaluation metrics Moreover, during the writing phase of this paper,
There have been some intensive arguments surrounding Japanese LLMs that are considered to perform better than
the evaluation of LLMs recently. Regarding the evaluation OpenCALM-7B, which was used in this study, have been
method of LLMs, there is still no unified “rule-of-thumb” released (see, e.g., Rakuda benchmark, https://yuzuai.jp/
method yet. While the existing metrics (e.g., JGLUE ) benchmark). There is a possibility of obtaining different
35
or leaderboards (e.g., Nejumi LLM leaderboard, http:// results when using them as the base model. Since one general
Volume 1 Issue 2 (2024) 113 doi: 10.36922/aih.2695

