Page 104 - AIH-1-2
P. 104

Artificial Intelligence in Health                                 Schema-less text2sql conversion with LLMs



            performance  in  the  text-to-SQL  domain. 14-16   However,   reasons highlighting the choice of fine-tuning FLAN-T5
            these fine-tuning methods typically require an extensive   for the specific text-to-SQL task:
            amount of labeled training data tailored to the specific   (i)  Fine-tuning FLAN-T5 is important to adapt the model
            task, and they are often susceptible to over-fitting. This   to specific tasks and improve its performance on those
            limitation raises concerns about their versatility and   tasks.
            efficiency in practical applications.              (ii)  Fine-tuning allows for customization of the model to
                                                                  better suit the user’s needs and data.
              The advent of LLMs like GPT has opened new avenues in
            text-to-SQL tasks, particularly due to their ICL capabilities.   (iii) The ability to fine-tune FLAN-T5 on local workstations
            These models often outperform fine-tuning methods in   with CPUs makes it accessible to a wider range of
                                                                  users.
            various NLP downstream tasks, especially in scenarios   (iv)  This accessibility is beneficial for smaller organizations
            requiring few-shot or zero-shot learning. Nevertheless, the   or individual researchers who may not have access to
            effectiveness of LLMs heavily relies on the design of input   GPU resources.
            prompts, a factor that significantly influences the output   (v)  Overall, fine-tuning FLAN-T5 is a valuable step
            quality. 17-19  The ICL performance of LLMs in text-to-SQL   in optimizing the model for specific use cases and
            tasks, especially the impact of different prompts, has also   maximizing its potential benefits.
            been examined.
                                                                 Our emphasis on exploring  schema-less approaches
              While basic prompting serves as a benchmark for   led us to investigate the viability and advantages of
            assessing  the fundamental  capabilities of  LLMs, more   implementing text-to-SQL systems that depend less on
            sophisticated prompt designs have shown to significantly   explicit knowledge of database schema.
            enhance performance. Notably, a few-shot learning
            approach employing GPT-4 recently set a new benchmark   3. Data and methods
            in text-to-SQL tasks, achieving state-of-the-art results.
            However, this method necessitates manual input for   This section delineates the comprehensive methodology
            demonstrations and tends to use a large number of tokens,   of our study, encompassing a detailed description of the
            requiring more time and resources.                 dataset utilized, the architecture of the model employed,
                                                               and the specifics of both the training and evaluation
              This study extends the current advancements in LLMs   processes.
            within the Text-to SQL domain. Specifically, we fine-tune
            Flan-T5-based  models  on the  MIMICSQL  dataset.  Each   3.1. Dataset
            of these models is a sequence-to-sequence LLM that can   The MIMICSQL dataset is a significant resource for
            be also used commercially. The model was published by   question-to-SQL generation in the healthcare domain,
            Google researchers in late 2022 and has been fine-tuned   comprising 10,000 question-SQL pairs. This large-scale
            on multiple tasks. It reframes various tasks into a text-to-  dataset is based on the Medical Information Mart for
            text format, such as translation, linguistic acceptability,   Intensive Care III (MIMIC III) dataset, a widely used
            sentence similarity, and document summarization.   electronic medical records (EMR) database. It is divided
            Similarly, the architecture of the Flan-T5 model  closely   into  two subsets:  one  containing  template questions
            aligns with the encoder-decoder structure utilized in the   (machine-generated)  and  the  other  featuring  natural
            original Transformer paper. The primary distinction lies   language questions (human-annotated).
            in the size and nature of the training data; Flan-T5 was
            trained on an extensive 750 GB corpus of text known as the   3.1.1. Diversity and complexity of the dataset
            Colossal Clean Crawled Corpus (C4), and it comes with   The MIMICSQL dataset covers a wide range of patient
            five variations: flan-t5-small (80M parameters, requiring   information categories, including demographics, laboratory
            300 MB in memory), flan-t5-base (250M parameters,   tests, diagnosis, procedures, and prescriptions. Those
            requiring 990 MB in memory), flan-t5-large (780M   categories are embedded as a schema structure that outlines
            parameters, requiring 1 GB in memory), flan-t5-xl  (3B   the database’s  tables, columns, and interrelationships,
            parameters, requiring 12 GB in memory), and flan-t5-  serving as a crucial guide for the models to comprehend
            xxl (11B parameters, requiring 80 GB in memory). These   the database structure and accurately formulate SQL
            models can be used for various NLP tasks out-of-the-box   queries.  Table 1 illustrates the DEMOGRAPHIC  table,
            (with zero or few shot); however, to leverage its full potential   while Table 2 presents the PROCEDURES table from the
            and ensure optimal performance for specific applications,   MIMICSQL dataset. This diversity reflects the complexity
            fine-tuning is a crucial step. Below are the main points and   and multidimensionality  of healthcare-related queries,



            Volume 1 Issue 2 (2024)                         98                               doi: 10.36922/aih.2661
   99   100   101   102   103   104   105   106   107   108   109