Page 117 - AIH-1-2
P. 117

Artificial Intelligence in Health                                Medical instruction-tuning for Japanese LLMs



            five alternatives, they are defined with slight variations. Let   inference refers to when one question-answer example is
            R  denote the  response  string  and  C*  denote  the  correct   included in the input prompt. In Table 3, the top 2 scores
            answer string among the five choices. Exact match takes   in each row are highlighted in bold.
            the value of 1 if R and C* exactly match at the string level,
            and 0 otherwise.  Gestalt score is defined as the Gestalt   4.2. Comparison of our string-based evaluation
            distance between the response and the correct answer,   metrics
            which is calculated by a string matching algorithm that is   Evaluation of LLMs is mainly conducted via manual
            based on the longest common subsequence: let K denote   evaluation  and automated evaluation based on rules.
                                                                       1
            the longest matched string, then Gestalt score is calculated   In automated evaluation methods, likelihood-based
            as GestaltScore(R) = 2|K|/(|R|+|C*|). Finally,  Accuracy   evaluation  is predominant. However, this evaluation
                                                                       34
            reflects the correctness by evaluating the choice closest to   method assesses the vectors outputted by the model rather
            the model’s response when measured using Gestalt score.   than the actual generated strings, making it unsuitable
            Definitions are summarized as follows:             for comparison with ChatGPT. To address this issue,
                                                               our evaluation metrics are based on the strings actually
              S = {C , C , C , C , C }: Choices,               outputted by the model. Exact match is a strict criterion
                         3
                   1
                            4
                      2
                              5
              C* (∈ S): The correct choice,                    where a response is considered correct only if it matches
              R: the response of the model,                    the correct answer precisely. Consequently, the number
                                                               of correct answers is lower because even slight deviations
              ExactMatch(R) = 1 if R = C* else 0,              are not considered correct. On the other hand, Accuracy is
              GestaltDistance(R,C) =  2|K|/(|R|+|C|),  K:  the longest   a relatively lenient metric where an output is considered
            matched string between R and C,                    correct as long as it is similar to the correct answer, even
                                                               if it is not an exact match. This leads to a relatively higher
              GestaltScore(R) = GestaltDistance(R,C*),         number of correct answers as compared to Exact match, as
              Accuracy(R) = 1 if argmax_{C ∈ S} GestaltDistance(R,C)   deviations are tolerated to some extent.
            = C* else 0.                                         Table 4 is a contingency table showing the number of
              All the evaluation metrics mentioned above take the   question-and-answer (Q&A) samples where the model
            value between 0 and 1, and the larger value indicates the   produced the correct answer. As a result, 112 question-
            better performance of the model.                   answer samples are considered correct in terms of Accuracy
                                                               but wrong in Exact match, whereas the reverse is not true.
            3.4. Experimental settings                         Among these 112 samples, many cases that were thought to

            The whole dataset used in this work is summarized in   be correct were not considered correct in the \textit{Exact
            Table 2. The experiments were run on 4 NVIDIA A100   match} evaluation. This was due to issues such as the
            with 80GB RAM each. All codes were implemented in   model’ s output being corrupted by token omissions in
            Python, and the software and libraries we used include   the tokenizer, or experiencing partial misrepresentation of
            Transformers  and PEFT  from Hugging Face.         Japanese characters, as observed in the examples listed in
                       33
                                26
                                                               Table 5. This result implies that Accuracy is more suitable
            4. Results                                         for evaluating performance in question-answering than
                                                               Exact match, as it is more robust against the issues that
            4.1. The effect of medical instruction-tuning      models may potentially encounter. Further discussion in
            The average score of experiments conducted for both   this regard is given in section 5.3.
            0-shot inference and 1-shot inference, measured by Exact
            match, Gestalt score, and  Accuracy is summarized in   4.3. Example responses from each model
            Table 3 and Figure 2. The 0-shot inference refers to making   We randomly created questions that ask each model the
            responses without any specific examples, while the 1-shot   treatment of a symptom. This type of medical question is

            Table 2. Datasets used in this work
            Name                                     Source type    Format type     Purpose     Number of records
            The Japanese circulation society       Academic journal  Alpaca format 32  Instruction-tuning  21365
            The Journal of the Japanese Society of Internal Medicine  Academic journal  Alpaca format 32  Instruction-tuning  56057
            IgakuQA 31                            Medical license exam  5-choice question  Evaluation  2002
            JJSIMQA                                 Review questions  5-choice question   Evaluation  460


            Volume 1 Issue 2 (2024)                        111                               doi: 10.36922/aih.2695
   112   113   114   115   116   117   118   119   120   121   122