Page 117 - AIH-1-2

P. 117

Artificial Intelligence in Health Medical instruction-tuning for Japanese LLMs

five alternatives, they are defined with slight variations. Let inference refers to when one question-answer example is
R denote the response string and C* denote the correct included in the input prompt. In Table 3, the top 2 scores
answer string among the five choices. Exact match takes in each row are highlighted in bold.
the value of 1 if R and C* exactly match at the string level,
and 0 otherwise. Gestalt score is defined as the Gestalt 4.2. Comparison of our string-based evaluation
distance between the response and the correct answer, metrics
which is calculated by a string matching algorithm that is Evaluation of LLMs is mainly conducted via manual
based on the longest common subsequence: let K denote evaluation and automated evaluation based on rules.
1
the longest matched string, then Gestalt score is calculated In automated evaluation methods, likelihood-based
as GestaltScore(R) = 2|K|/(|R|+|C*|). Finally, Accuracy evaluation is predominant. However, this evaluation
34
reflects the correctness by evaluating the choice closest to method assesses the vectors outputted by the model rather
the model’s response when measured using Gestalt score. than the actual generated strings, making it unsuitable
Definitions are summarized as follows: for comparison with ChatGPT. To address this issue,
our evaluation metrics are based on the strings actually
S = {C , C , C , C , C }: Choices, outputted by the model. Exact match is a strict criterion
3
1
4
2
5
C* (∈ S): The correct choice, where a response is considered correct only if it matches
R: the response of the model, the correct answer precisely. Consequently, the number
of correct answers is lower because even slight deviations
ExactMatch(R) = 1 if R = C* else 0, are not considered correct. On the other hand, Accuracy is
GestaltDistance(R,C) = 2|K|/(|R|+|C|), K: the longest a relatively lenient metric where an output is considered
matched string between R and C, correct as long as it is similar to the correct answer, even
if it is not an exact match. This leads to a relatively higher
GestaltScore(R) = GestaltDistance(R,C*), number of correct answers as compared to Exact match, as
Accuracy(R) = 1 if argmax_{C ∈ S} GestaltDistance(R,C) deviations are tolerated to some extent.
= C* else 0. Table 4 is a contingency table showing the number of
All the evaluation metrics mentioned above take the question-and-answer (Q&A) samples where the model
value between 0 and 1, and the larger value indicates the produced the correct answer. As a result, 112 question-
better performance of the model. answer samples are considered correct in terms of Accuracy
but wrong in Exact match, whereas the reverse is not true.
3.4. Experimental settings Among these 112 samples, many cases that were thought to

The whole dataset used in this work is summarized in be correct were not considered correct in the \textit{Exact
Table 2. The experiments were run on 4 NVIDIA A100 match} evaluation. This was due to issues such as the
with 80GB RAM each. All codes were implemented in model’ s output being corrupted by token omissions in
Python, and the software and libraries we used include the tokenizer, or experiencing partial misrepresentation of
Transformers and PEFT from Hugging Face. Japanese characters, as observed in the examples listed in
33
26
Table 5. This result implies that Accuracy is more suitable
4. Results for evaluating performance in question-answering than
Exact match, as it is more robust against the issues that
4.1. The effect of medical instruction-tuning models may potentially encounter. Further discussion in
The average score of experiments conducted for both this regard is given in section 5.3.
0-shot inference and 1-shot inference, measured by Exact
match, Gestalt score, and Accuracy is summarized in 4.3. Example responses from each model
Table 3 and Figure 2. The 0-shot inference refers to making We randomly created questions that ask each model the
responses without any specific examples, while the 1-shot treatment of a symptom. This type of medical question is

Table 2. Datasets used in this work
Name Source type Format type Purpose Number of records
The Japanese circulation society Academic journal Alpaca format 32 Instruction-tuning 21365
The Journal of the Japanese Society of Internal Medicine Academic journal Alpaca format 32 Instruction-tuning 56057
IgakuQA 31 Medical license exam 5-choice question Evaluation 2002
JJSIMQA Review questions 5-choice question Evaluation 460

Volume 1 Issue 2 (2024) 111 doi: 10.36922/aih.2695

112 113 114 115 116 117 118 119 120 121 122