Page 119 - AIH-1-2
P. 119

Artificial Intelligence in Health                                Medical instruction-tuning for Japanese LLMs




            Table 5. Some example outputs from Llama2‑70B (0.9k steps of QLoRA)
            Correct choice                                  Model’s output       Exact match  Gestalt score  Accuracy
            Workers’ accident compensation insurance   Similar but having tokenizing issue (労働者補  0  0.538  1
            (労働者災害補償保険 )                         償保険)
             Functional menstrual disorders (機能性月経困難症 )  Correct except third letter (機能的月経困難症)  0  0.736  1
             Intra-articular hemorrhage (関節内出血 )  Correct except the second letter (関篾内出血)  0  0.800     1
            Monthly standard compensation for patient households  Similar but having tokenizing issue (患者世帯  0  0.666  1
            (患者世帯の標準報酬月額 )                       患者世帯標準報酬月額)


            Table 6. Example responses from each model         wandb.me/nejumi) can assess the fluency of generated
                                                               texts, they do not adequately evaluate the accuracy
            Model name  LoRA/   Response (exact output from LLM)  of domain-specific knowledge. It is  noteworthy that
                        QLoRA
                                                               three metrics used in our experiments also exhibit
            OpenCALM-7B  None   At the bedside, the nurse encouraged   certain shortcomings. For example, Exact match cannot
                                the patient to move their legs to assist in
                                getting out of bed. The patient attempted   accurately score responses that, while conveying the
                                to flex and extend their legs; however,   correct meaning, do not match the text verbatim. Gestalt
                                they were unable to move the leg joints   score is asymmetric and prone to multiple choices.
                                and needed assistance to stand up   Overall, our string-based metrics fall short in identifying
                                independently. The nurse approached the   phrases  with different expressions but conveying  the
                                patient sitting on a bedside chair, provided
                                shoulder support, lifted their arms,   same meaning, and reflecting aspects such as fluency
                                extended their elbows, and helped them   and medical accuracy. We argue that these features are
                                rise slowly. The nurse then gently released   not problematic in question-answering tasks where the
                                their grip and returned to the chair. The   model is required to output one or a few choices in short
                                nurse proceeded to call the next patient.*  texts, but they become problematic when evaluating LLM
                        1k steps of  The procedure to be carried out for   for practical tasks, including medical report generation,
                        LoRA    patients with the following disease   where these aspects are crucial.
                                involves the implementation of
                                appropriate treatment and preventive   Furthermore, even the use of multiple-choice questions
                                measures.*                     for evaluating LLMs has been controversial. 36,37  The
                        3k steps of  Detailed treatment methods have not   development of even more superior evaluation metrics is
                        LoRA    been provided.*                eagerly anticipated.
            Llama2-70B  None    In some cases, a procedure called
                                thrombol.                      5.4. Difficulty and limitations
                        0.9k steps  The treatment of deep vein thrombosis   While numerous LLM training techniques are still in the
                        of QLoRA involves the use of thrombolytic agents
                                and anticoagulants.*           developmental stage, several shortcomings of training
                                                               medical LLMs,  like what  we have done in  this work,
                        3k steps of  The treatment for deep vein thrombosis
                        QLoRA   includes pharmacological therapy to   should be highlighted. First and foremost, the quantity
                                dissolve the blood clot as well as surgical   and quality of data could be insufficient in our work.
                                interventions to remove the thrombus.*  Preparing a medical dataset in instructional format can
            Note: *Originally in Japanese.                     be expensive. In this study, we employed ChatGPT for
            Abbreviation: LLMs: Large language models.         automated generation, but this approach may become
                                                               financially burdensome when preparing larger datasets.
            responses provided within the context, whereas Llama2-70B   Data cleansing has also consistently posed challenges, and
            retains this ability even after instruction-tuning.  achieving perfect results in this work may not have been
                                                               feasible.
            5.3. Evaluation metrics                              Moreover, during the writing phase of this paper,
            There have been some intensive arguments surrounding   Japanese LLMs that are considered to perform better than
            the evaluation of LLMs recently. Regarding the evaluation   OpenCALM-7B, which was used in this study, have been
            method of LLMs, there is still no unified “rule-of-thumb”   released (see,  e.g., Rakuda benchmark, https://yuzuai.jp/
            method yet.  While the  existing metrics  (e.g., JGLUE )   benchmark). There is a possibility of obtaining different
                                                        35
            or leaderboards (e.g., Nejumi LLM leaderboard, http://  results when using them as the base model. Since one general


            Volume 1 Issue 2 (2024)                        113                               doi: 10.36922/aih.2695
   114   115   116   117   118   119   120   121   122   123   124