Page 107 - AIH-1-2
P. 107

Artificial Intelligence in Health                                 Schema-less text2sql conversion with LLMs




            Table 3. A running example for the preprocessing steps
            Original text               Preprocessed text          Tokenized text
            Input: Get the number of patients   Input: transform: get the number of   Input: ['▁transform', ':', '▁Get', '▁the', '▁number', '▁of',
            who died in or before 2132 and were   patients who died in or before 2132 and   '▁patients', '▁who', '▁died', '▁in', '▁or', '▁before', '▁21', '32',
            less than 72 years of age. Output:   were less than 72 years of age.  '▁and', '▁were', '▁less', '▁than', '▁72', '▁years', '▁of', '▁age', '.']
            SELECT COUNT (DISTINCT      Output: select count (distinct   Output: ['▁select', '▁count', '▁(', '▁distinct', '▁demographic',
            DEMOGRAPHIC.'SUBJECT_ID')   demographic.'subject_id') from   '.', "'", 'sub', 'ject', '_', 'i', 'd', "'", '▁', ')', '▁from', '▁demographic',
            FROM DEMOGRAPHIC WHERE      demographic where demographic.'age'   '▁where', '▁demographic', '.', "'", 'age', "'", '▁less', '▁than', "'", '72',
            DEMOGRAPHIC.'AGE' < '72' AND   less than '72' and demographic.'dod_year'  "'", '▁and', '▁demographic', '.', "'", 'd', 'o', 'd', '_', 'year', "'", '▁less',
            DEMOGRAPHIC.'DOD_YEAR'      less than or equal to '2132.0'  '▁than', '▁or', '▁equal', '▁to', '▁', "'", '2', '13', '2.0', "'"]
            <= '2132.0'
            Input: calculate the minimum days for   Input: transform: calculate the minimum  Input: ['▁transform', ':', '▁calculate', '▁the', '▁minimum',
            which patients aged 20 years or older   days for which patients aged 20 years or   '▁days', '▁for', '▁which', '▁patients', '▁aged', '▁20', '▁years',
            were hospitalized. Output: SELECT   older were hospitalized.  '▁or', '▁older', '▁were', '▁hospital', 'ized', '.']
            MIN (DEMOGRAPHIC.'DAYS_STAY')   Output: select min (demographic.'days_  Output: ['▁select', '▁min', '▁(', '▁demographic', '.', "'", 'day',
            FROM DEMOGRAPHIC WHERE      stay') from demographic where   's', '_', 'stay', "'", '▁', ')', '▁from', '▁demographic', '▁where',
            DEMOGRAPHIC.'AGE' >= '20'   demographic.'age' greater than or equal   '▁demographic', '.', "'", 'age', "'", '▁greater', '▁than', '▁or',
                                        to '20'                    '▁equal', '▁to', '▁', "'", '20', "'"]

            4.1. Experimental setup                              While evaluating other approaches, significant attention
            A key objective of this study is to evaluate the practicality   was given to the construction of prompts for LLaMA-
            of training and inference processes. For this purpose, we   2-7B, GPT-3.5-Turbo, GPT-4, and DeFog-SQLCoder
            employed  a  standalone  machine  equipped  with  a  single   to effectively generate SQL queries. For these models,
            Nvidia V100 GPU (16 GB vRAM) and 32 GB of system   prompts were meticulously designed to include  schema
            memory.                                            information, facilitating  the generation  of  accurate SQL
                                                               queries. These prompts are essential for guiding the models
              In  our implementation,  we utilized  both  Flan-T5   through the task, leveraging their inherent language
            Base and Large versions of models, based on the original   understanding  capabilities.  Detailed  examples  of  these
            T5 encoder-decoder architecture, augmented with an   prompts are provided in Figure 3. Notably, our approach
            instruction-finetune mechanism. This architecture consists   with the Flan-T5 models is distinct from this conventional
            of multiple layers of transformer blocks, including self-  method, removing the need for any schema information
            attention mechanisms and feed-forward neural networks.   in the prompts. In this approach, the Flan-T5 models are
            These transformer blocks enable the model to capture   fine-tuned in a way that the questions are the only inputs,
            long-range dependencies and contextual information from   and the database  schema can be captured automatically
            input sequences.                                   in a more efficient fashion. This distinction underscores
              The input and output sequence lengths were       the uniqueness and efficiency of our methodology. It is
            standardized  to 1024  tokens.  Sequences  exceeding this   important to mention that TREQS models, not being
            length were truncated, while shorter sequences were padded   categorized as LLMs, did not necessitate such prompt-
            using a pad token. This sequence length configuration   based approaches, but used the  schema as input with
            enabled a maximum training batch size of two on our GPU   the question, further differentiating our method from
            setup. Various learning rates and optimizers were tested,   traditional practices.
            ultimately leading to the selection of the Adam optimizer
            with a learning rate of 5e-5.                      4.2. Evaluation metric

              The fine-tuning phase for each model spanned five   The most common and used evaluation metric for the
            epochs,  a  decision  based  on  empirical  observations  of   text-to-SQL task are logical form accuracy (LFA, or
                                                                                                20
            convergence and generalization in preliminary trials. This   exact matching) and execution accuracy,  and since the
            epoch count offered an optimal balance between model   MIMIQSQL community has not shared databases for
            performance and training duration, ensuring the models   computing the execution accuracy metric, the primary
            adequately learned from the MIMICSQL dataset patterns   metric we used for evaluating the models performance
            without overfitting. In terms of time, completing five   in this study is LFA. In fact, it measures the percentage
            epochs of training took approximately 4 h for the Flan-T5   of exact string match between the generated SQL queries
            Base version and 7 h for the Flan-T5 Large version. Table 4   and the ground truth SQL queries. It is quantified as the
            provides in details the tuning scenario.           percentage of instances where the model’s predicted SQL


            Volume 1 Issue 2 (2024)                        101                               doi: 10.36922/aih.2661
   102   103   104   105   106   107   108   109   110   111   112