Page 105 - AIH-1-2
P. 105

Artificial Intelligence in Health                                 Schema-less text2sql conversion with LLMs



            as the SQL queries generated from these questions often   evaluating model performance by highlighting potential
            involve multiple tables and columns.               challenges associated with varying lengths. Understanding
                                                               these distributions is crucial for both effective selections
            3.1.2. Size and partitioning                       of the model training hyperparameters, especially the
            The MIMICSQL dataset comprises approximately 10,000   input and the output length, as well as assessment of the
            examples,  strategically  partitioned  into  training  and   generalizability of the developed models to real-world
            development (train and dev) sets, constituting 80% (8000   applications.
            question-sql pairs), and a test set accounting for the   3.1.3. Challenges addressed
            remaining 20% (2000 question-sql pairs). This division
            facilitates both training and evaluation phases. Insights and   One of the key challenges addressed by the MIMICSQL
            statistical distributions from the MIMICSQL dataset are   dataset is the prevalence of abbreviations and potential typos
            illustrated in Figure 1, and an illustrative example from the   in healthcare-related questions. This poses a significant
            dataset is shown in Figure 2. Specifically, Figure 1A depicts   obstacle to accurately generating the corresponding SQL
            the distribution of natural language questions, while   queries, as the keywords provided in the questions may not
            Figure 1B focuses on the distribution of SQL query lengths.   precisely match those used in the EMR data. Consequently,
            The presentation of natural language (NL) question and   the dataset presents a real-world scenario that requires
            SQL query length distributions in the MIMICSQL dataset   models to effectively handle the nuances and complexities
            serves to reveal the dataset’s inherent characteristics,   of healthcare-related queries.
            aiding in the design of models capable of handling diverse   3.2. Problem formulation
            language  structures.  In addition, it  provides  a  basis for
                                                               The  SQL query  generation  task  can  be  formulated as
                                                               follows: Let D = {(Qi, SQL(Qi))} for i = 1, 2,…, N represents
            Table 1. Example of the DEMOGRAPHIC table from     the dataset, where Qi represents the i-th natural language
            theMIMICSQL database                               question, and SQL(Qi) is the corresponding ground-truth
            SUBJECT_ID  HADM_ID  Gender  ADMISSION_TYPE …      SQL query. The objective is to learn a mapping function
            990           184231  F     EMERGENCY       …      F(Q; θ) parameterized by θ using the LLMs Flan-T5 Base
            17772         122127  M     NEWBORN         …      and Flan-T5 Large:
            …              …     …      …               …      SQL(Q)=F(Q;θ)                               (I)
            66411         178264  F     EMERGENCY       …        The training process involves minimizing the cross-
                                                               entropy loss between the predicted SQL queries and the
                                                               ground-truth queries:
            Table 2. EExample of the PROCEDURES table from the
            MIMICSQL database                                  Loss(θ)= −∑ log P (SQL(Q )|Q( ;θ)           (II)
                                                                                    i
                                                                                        i
            SUBJECT_ID    HADM_ID     SHORT_TITLE      …         where P(SQL(Q )|Q ;θ) denotes the probability of the
                                                                               i
                                                                                  i
            9258            183354    Procedure-one vessel  …  model predicting the correct SQL query for the i-th natural
            28588           141664    Insert endot. tube  …    language question (Q ), given the model parameters θ. Our
                                                                                i
                                                               schema-less approach entails using only the input question
            …                 …       …                …       Q as context during inference, without explicit database
            66411           178264    Abdomen artery inc.  …   schema information.
                         A                                    B













            Figure 1. Distribution of the input and the output sequences in the MIMICSQL dataset. (A) The distribution of length on the questions (the input).
            (B) The distribution of length of SQL queries.

            Volume 1 Issue 2 (2024)                         99                               doi: 10.36922/aih.2661
   100   101   102   103   104   105   106   107   108   109   110