Page 116 - AIH-1-2
P. 116

Artificial Intelligence in Health                                Medical instruction-tuning for Japanese LLMs




            Table 1. LoRA/QLoRA parameters                        “problem_text”: “Which of the following is incorrect
                                                                  regarding hypertension caused by obstructive sleep
                                 OpenCALM‑7B     Llama2‑70B       apnea?,”
            Fine-tuning method      LoRA          QLoRA           “choices”: {“a”: “It often leads to nocturnal hypertension.”,
            Learning rate            5e -5         2e-4           “b”: “Weight reduction is recommended for obese
            Input length             512            512           patients.”, “c”: “Alpha-blockers are the first-line choice of
            Target max length        512            512           medication.”, “d”: “Morning hypertension is frequently
            Batch size                8             8             observed in home blood pressure measurements.”, “e”:
            Fine-tuning steps      1k, 3k, 10k    0.9k, 3k        “Continuous positive airway pressure (CPAP) therapy
                                                                  is expected to lower blood pressure.”},
            r of (Q) LoRA             8             64            “text_only”: True,
            α of (Q) LoRA            32             16            “answer”: [“c”]
            Dropout rate of (Q)      0.05           0.1           An example from JJSIMQA, 5-choice questions in
            LoRA                                                  JJSIM (originally in Japanese)
            Target parameter     Query, Key, value  All linear layers     “problem_text”: “Which of the following is incorrect
                                                                  about recent cases of hepatitis B in Japan? Choose one.”,
               as “instruction”: Question content, “output”: Answer      “choices”: {“a”: “The HBs antigen positivity rate has
               content, and do not include line breaks. Repeat this   significantly decreased due to the initiation of mother-
               process 15 times and list one data pair per line.  to-child infection prevention programs.”, “b”: “HBV
               ### Input: {input_text}                            (hepatitis B virus) genotype Ae can become a carrier
                                                                  through horizontal transmission in adults.”, “c”: “In
              The number of epochs and steps was set to align with   Japan, routine HBV vaccination began in October
            the overall computational time in each experiment. Using   2016.”, “d”: “HBV genotype  C is more prevalent in
            a larger model such as Llama2-70B increases the GPU   the Tohoku and Miyako-Yaeyama regions.”, “e”:
            memory usage per sample. To avoid this, memory usage can   “Horizontal transmission of HBV during childhood
            be reduced by decreasing the floating-point precision or   is thought to be partly attributed to father-to-child
            by using gradient accumulation. In this study, we adopted   transmission and communal living.”},
            4-bit QLoRA on Llama2-70B. Since 4 bits is optimal in      “text_only”: True,
            terms of the relationship between floating-point precision      “answer”: [“d”]
            and model performance,  it is not desirable to reduce the
                                27
                                                                 The prompt template used for the evaluation follows
            floating-point precision any further. To experiment with   the Alpaca-format,  where “problem_text” is incorporated
                                                                              32
            less GPU memory, gradient accumulation was attempted   in {instruction} and “choices” is incorporated in {input}:
            by multiplying batch size calculation, for example, a batch
            size of 8 is calculated twice with four smaller mini-batch      Below is an instruction that describes a task, paired
            sizes. This approach allows for building larger models and   with an input that provides further context. Write a
            reducing requirements for computing resources.        response that appropriately completes the request.
                                                                  ### Instruction:
            3.3. Evaluation by medical question-answering tasks   {Instruction}

            The state-of-the-art performance of English medical LLMs      ### Input:
            is typically evaluated using benchmark datasets such as   {Input}
            MedQA (United States Medical Licensing Examination,      ### Response:
            USMLE),  MedMCQA,  and PubMedQA.  However,           For evaluation in our experiments, these prompts were
                   28
                                29
                                                30
            the availability of Japanese-curated medical task datasets   given in Japanese for OpenCALM-7B and in English for
            is significantly limited, with IgakuQA (Japanese medical   Llama2-70B. When generating the responses, we can
            licensing exams)  being the only one available at present.   specify  parameters.  In  our  experiments,  temperature
                         31
            Hence, in addition to IgakuQA, we prepared a new Q&A   was set to 0.1, max_new_tokens to 256, top_p to 0.9, and
            dataset JJSIMQA to assess the performance of each model in   repetition_penalty to 1.05. Question-answering samples
            the medical domain. JJSIMQA is our own dataset comprising   that yielded null responses were excluded from the dataset.
            5-choice questions included in JJSIM as appendices. Here   Finally, we evaluated the output responses of each
            are some samples from IgakuQA and JJSIMQA datasets:
                                                               model by three different metrics:  Exact match, Gestalt
               An example from IgakuQA (originally in Japanese)  score, and Accuracy. While all these metrics aim to assess
               “problem_id”: “116A1”,                          how effectively models can select the correct choice from


            Volume 1 Issue 2 (2024)                        110                               doi: 10.36922/aih.2695
   111   112   113   114   115   116   117   118   119   120   121