Page 115 - AIH-1-2
P. 115

Artificial Intelligence in Health                                Medical instruction-tuning for Japanese LLMs



            knowledge, there has been n research conducted to deepen   during the generation of text.  We employed LoRA, one
                                                                                       25
            the medical specialization of Japanese-centric model.  of the popular parameter-efficient fine-tuning methods
                                                               provided in PEFT library,  since full fine-tuning,
                                                                                      7,26
            3. Data and methods                                which retrains all model parameters, is unfeasible in our
            We conducted a comprehensive comparison between    environment. LoRA freezes the pretrained model weights
            different LLMs fine-tuned with Japanese medical dataset,   and inserts trainable rank decomposition matrices into
            including those we have created ourselves. To determine   each layer of the target model to reduce the number of
            whether one should start from a smaller Japanese model   trainable parameters for downstream tasks. Specifically,
            or a larger English model, we prepared OpenCALM-7B   instead of directly updating the d × k parameter matrix of
            and Llama2-70B as base models. In addition, to observe   a linear layer in LLM from W  to W +ΔW, LoRA updates
                                                                                       0
                                                                                            0
            the effectiveness of pretraining, we introduced a model   a d × r matrix B and a r × k matrix A where BA is low-rank
            additionally trained on medical documents. Subsequently,   decomposition of ΔW, that is, r ≪min (d,k).
            we applied medical instruction-tuning (LoRA, QLoRA)   Given our computational constraints, particularly
            to each of them and evaluated performance based on the   the limited GPU memory, LoRA for OpenCALM-7B is
            accuracy of medical question-answering tasks. The entire   feasible, but not for Llama2-70B. Instead, we opted for
            procedure is outlined in Figure 1. The models trained and   the quantized version, named QLoRA,  which is intended
                                                                                              8
            used in our experiments are available at https://huggingface.  to trade off a slight performance drop for a significant
            co/AIgroup-CVM-utokyohospital.                     reduction in model size, making the experiment using

            3.1. Base model preparation                        Llama2-70B  feasible. Consequently, we  applied  LoRA  to
                                                               OpenCALM-7B and QLoRA to Llama2-70B, respectively.
            To create a Japanese-centric model, we utilized    The hyperparameters of LoRA/QLoRA are listed in Table 1,
            OpenCALM-7B (https://huggingface.co/cyberagent/open-  which follow the default setting specified in PEFT library
            calm-7b), an open-source Japanese foundation LLM with   and QLoRA library, respectively. 8,26
            6.5 billion parameters developed by CyberAgent, Inc. In
            addition, we trained a new base model MedCALM, which   To perform medical instruction-tuning, we constructed
            is based on OpenCALM-7B and continually pretrained on   a  medical  question-answer  dataset  containing  77422
            our own medical text dataset. Here, the training dataset   records in instruction format. Initially, we reviewed two
            consists of 2420 examples, and the evaluation dataset has   medical articles, one from the official journal of The
            50 examples. The maximum token count is set to 768,   Japanese Circulation Society (containing 3569 lines) and
            and the batch size is set to 63. The model was trained for   another from the Journal of the Japanese Society of Internal
            2000 steps. On the other hand, we further used Llama2-  Medicine (JJSIM, containing 6120 lines), for input retrieval.
            70B-chat-hf (https://huggingface.co/meta-Llama/Llama-  Then, these texts were used as inputs for ChatGPT (gpt-3.5-
            2-70b-chat-hf), a powerful English-centric LLM released   turbo) to generate various question-answer pairs, resulting
            by Meta Inc.  Hereinafter, it is referred to as Llama2-70B.   in 21365 records and 56057 records, respectively. Since
                      24
            The use of this model is governed by the Meta license   ChatGPT is known to possess strong instruction-following
            (https://ai.meta.com/resources/models-and-libraries/  ability, we utilized the following prompt template to
            llama-downloads/).                                 construct instruction dataset with an overall good quality:
                                                                  ### Instructions: You are a machine designed
            3.2. Medical instruction-tuning                       to generate various question and answer pairs.
            Instruction-tuning refers to the process of fine-tuning   Please create data with question (instruction) and
            or optimizing the behavior and output of the model by   answer (output) pairs based on the following input,
            providing explicit instructions or guidance as a prompt   considering it as prior knowledge. Format the data













                           Figure 1. Overview of procedure of our medical instruction-tuning. Image created with Adobe Illustrator.


            Volume 1 Issue 2 (2024)                        109                               doi: 10.36922/aih.2695
   110   111   112   113   114   115   116   117   118   119   120