Page 106 - AIH-1-2
P. 106

Artificial Intelligence in Health                                 Schema-less text2sql conversion with LLMs

















            Figure 2. Illustrative example demonstrating how the MIMICSQL dataset utilizes the DEMOGRAPHICS and PROCEDURES tables to construct a
            response to a given question. This example employs color coding to distinctly indicate the correlations between components of the source question, the
            corresponding SQL query, and the SQL query template. Such examples highlight the dataset’s structure and the complexity of mapping natural language
            questions to SQL queries.


            3.3. Preprocessing                                 3.3.4. Data cleaning
            To effectively utilize the MIMICSQL dataset within the   The majority of the data cleaning process was conducted
            constraints of LLMs like Flan-T5, specific preprocessing   beforehand on the MIMIC III database, while building
            steps are essential. These steps are designed to address the   the  MIMICSQL  dataset,  involving correcting  errors  in
            complexities of SQL syntax and the particular decoding   patient demographics, standardizing the format of clinical
            capabilities of such models. A  key consideration in this   notes, and filtering out irrelevant data. On the MIMICSQL
            process is the model’s vocabulary, as an excessive number   dataset, we only removed some duplicated statements and
            of special tokens can detrimentally affect performance.   corrected some typos in the SQL queries.
            The following outlines the detailed preprocessing steps
            undertaken:                                        3.3.5. Tokenization and text prefixing
                                                               In  the  preprocessing  phase,  we  employed  the  Flan-T5
            3.3.1. Enclosing JSON objects in an array
                                                               tokenizer  to  tokenize  the  input  and  the  output texts.
            Individual JSON objects in the MIMICSQL dataset were   Flan-T5 utilizes a sub-word tokenizer to break down the
            enclosed within an array to ensure a consistent JSON array   input texts into smaller units, capturing both word-level
            structure. This step was essential for data manipulation   and sub-word information. It is based on SentencePiece, a
            and loading during subsequent processing.          popular unsupervised text tokenizer and detokenizer, that

            3.3.2. Replacing SQL characters                    employs a segmentation algorithm to divide the input texts
                                                               into sub-word units, allowing the model to handle a wide
            Replacing symbols with their corresponding words in text   range of vocabulary and linguistic nuances. By leveraging
            preprocessing  is a  common  practice  in NLP  and  offers   Flan-T5 Large’s tokenization approach, we aim to capture
            several advantages. One of the reasons for this practice is   the contextual information present in both complete
            to address the absence of certain special characters, such   words and sub-word units, enhancing the model’s ability
            as “<,” “<=,” and “<>,” in the vocabulary of models like   to comprehend and generate meaningful sequences during
            Flan-T5. On the other hand, this replacement enhances   the subsequent stages of our text-to-SQL task. Moreover,
            model understanding as words are typically more    we added a prefix “transform:” to each natural language
            interpretable and easier for the model to learn. It can lead   question. The prefix is specific to the T5-based model,
            to improved generalization, as models have an easier time   allowing it to recognize the text as a task to be transformed
            working with words, which are part of natural language,   into SQL queries. For the target, a padding token ID is
            compared to arbitrary symbols. In addition, using words   set to −100, an adjustment designed to disregard padding
            reduces ambiguity, as symbols can be context-dependent   tokens during loss calculation.
            and unclear.
                                                                 Table 3 shows a running example for the preprocessing
            3.3.3. Converting JSON to CSV                      of two entries from the MIMICSQL dataset.
            After the preprocessing steps, the JSON data were   4. Experiments and results
            transformed into  CSV format to facilitate  compatibility
            with data analysis and modeling libraries. The CSV format   In this section, we will describe the experimental setup,
            with “text” and “sql” columns allowed seamless data   infrastructure details and evaluation metrics for the
            integration into the training and evaluation processes.  experiment, as well as a comparison with other models.


            Volume 1 Issue 2 (2024)                        100                               doi: 10.36922/aih.2661
   101   102   103   104   105   106   107   108   109   110   111