Page 103 - AIH-1-2
P. 103

Artificial Intelligence in Health                                 Schema-less text2sql conversion with LLMs



            enabling users to formulate queries in natural language,   As we delve into these research questions, our
            thereby lowering the barriers to data access and analysis.  methodology  strategically  leverages  schema-less
              In the past decade, the field of natural language   questions, with a deliberate focus on mitigating the
            processing  (NLP),  especially  through  the  development   challenges posed by complex and lengthy input prompts.
            of LLMs, has seen remarkable progress, substantially   While acknowledging that this approach may potentially
                                                  1,2
            enhancing text-to-SQL systems’ performance.  Models   limit generalization across diverse database schemas, we
            such as T5, LLaMA, GPT-3, GPT-3.5, and GPT-4 have   anticipate that the pronounced enhancement in overall
            been pivotal in advancing natural language understanding   performance will substantiate this deliberate trade-off.
            and generation, displaying a profound ability to process   The organization of this paper is as follows: Section 2
            and produce human-like text. Despite these advancements,   presents a thorough review of the existing literature in the
            adapting these versatile models for specific applications,   text-to-SQL field. Section 3 describes our methodology,
            such as generating SQL queries for structured data,   including details about the MIMICSQL dataset,
            remains a significant challenge.                   preprocessing steps, and the fine-tuning process. Section
              In this research, we aim to tackle the dual challenges   4 discusses the experimental setup, covering evaluation
            of simplifying input prompts and elevating accuracy in   metrics, comparison methods, experimental results,
            the generation of SQL queries, with a specific focus on   and their analysis. The final section concludes the paper,
            the intricate landscape of the medical domain. Given the   summarizing our contributions and highlighting the
            critical importance of precision in data retrieval within   significance of applying LLMs to the text to-SQL task, with
            healthcare contexts, our primary goal is to fine-tune   a special emphasis on schema-less querying.
            Flan-T5-based  models  using  text-to-SQL  query  pairs   2. Related works
            meticulously tailored for the medical MIMICSQL dataset.
                                                          3
            The decision to utilize a medical dataset in our research   The task of text-to-SQL is to convert natural utterances
            is driven by the distinctive challenges and precision   into SQL queries. This field has attracted researchers in
            requirements inherent in health-care data retrieval.   the NLP  and the database community for decades.
                                                                                                            4-9
            The choice of the MIMICSQL dataset, derived from the   The methodologies currently in use to handle this task
            widely-used  MIMIC-III  database,  provides  a  realistic   can be broadly divided into three categories: rule-based
            and clinically relevant context, allowing us to address the   methods, fine-tuning methods, and in-context learning
            complexities  of  real-world  medical  scenarios.  Focusing   (ICL) methods. Rule-based approaches, as highlighted in
            on the medical domain enables us to tailor our approach   other studies,  utilize predefined templates to generate
                                                                          7,10
            to the unique intricacies of healthcare data, contributing   SQL queries. These methods show proficiency in certain
            directly to advancements in medical data management. By   scenarios but are limited by the necessity for manual rule
            enhancing the accuracy of SQL query generation in this   formulation, which restricts their versatility across diverse
            specific context, our research seeks to deliver a meaningful   domains.
            impact on the efficiency of data retrieval in medical   Addressing the limitations of rule-based methods,
            databases, benefiting health-care professionals, researchers   recent research has ventured into more flexible approaches.
            and decision-makers.                               The utilization of bi-directional long-short-term memory
              To guide our investigation effectively, we pose the   and convolutional networks  in Seq2Seq models
                                                                                        11
            following research questions:                      has enhanced adaptability and effectiveness, though
            (i)  How can we optimize the formulation of input prompts   integrating structural database information remains a
               to simplify the querying process while maintaining   persistent challenge. Graph neural networks have emerged
               the necessary specificity required for medical data   as a solution to this, with approaches that treat the database
               retrieval?                                      schema as a graph, as seen in other works. 12,13  Furthermore,
            (ii)  What adjustments and enhancements can be made to   the introduction of the MIMICSQL dataset and a model-
               Flan-T5-based  models to improve  their  accuracy in   based system by Translate-Edit Model for Question-to-SQL
               generating SQL queries tailored to the nuances of the   (TREQS) marked a significant advancement in text-to-
               medical MIMICSQL dataset?                       SQL, particularly in the medical domain. Their model sets
            (iii) How  do  schema-less  questions  contribute  to   a robust baseline for subsequent evaluations. In our study,
               streamlining input prompt complexity, and what   we use the MIMICSQL dataset and the TREQS model as
               impact  does  this  simplification  have  on  the  overall   benchmarks to evaluate and compare the effectiveness of
               performance of SQL query generation in the medical   our proposed method. In addition, fine-tuning pretrained
               domain?                                         language models like T5 have demonstrated improved


            Volume 1 Issue 2 (2024)                         97                               doi: 10.36922/aih.2661
   98   99   100   101   102   103   104   105   106   107   108