Page 106 - AIH-1-2
P. 106
Artificial Intelligence in Health Schema-less text2sql conversion with LLMs
Figure 2. Illustrative example demonstrating how the MIMICSQL dataset utilizes the DEMOGRAPHICS and PROCEDURES tables to construct a
response to a given question. This example employs color coding to distinctly indicate the correlations between components of the source question, the
corresponding SQL query, and the SQL query template. Such examples highlight the dataset’s structure and the complexity of mapping natural language
questions to SQL queries.
3.3. Preprocessing 3.3.4. Data cleaning
To effectively utilize the MIMICSQL dataset within the The majority of the data cleaning process was conducted
constraints of LLMs like Flan-T5, specific preprocessing beforehand on the MIMIC III database, while building
steps are essential. These steps are designed to address the the MIMICSQL dataset, involving correcting errors in
complexities of SQL syntax and the particular decoding patient demographics, standardizing the format of clinical
capabilities of such models. A key consideration in this notes, and filtering out irrelevant data. On the MIMICSQL
process is the model’s vocabulary, as an excessive number dataset, we only removed some duplicated statements and
of special tokens can detrimentally affect performance. corrected some typos in the SQL queries.
The following outlines the detailed preprocessing steps
undertaken: 3.3.5. Tokenization and text prefixing
In the preprocessing phase, we employed the Flan-T5
3.3.1. Enclosing JSON objects in an array
tokenizer to tokenize the input and the output texts.
Individual JSON objects in the MIMICSQL dataset were Flan-T5 utilizes a sub-word tokenizer to break down the
enclosed within an array to ensure a consistent JSON array input texts into smaller units, capturing both word-level
structure. This step was essential for data manipulation and sub-word information. It is based on SentencePiece, a
and loading during subsequent processing. popular unsupervised text tokenizer and detokenizer, that
3.3.2. Replacing SQL characters employs a segmentation algorithm to divide the input texts
into sub-word units, allowing the model to handle a wide
Replacing symbols with their corresponding words in text range of vocabulary and linguistic nuances. By leveraging
preprocessing is a common practice in NLP and offers Flan-T5 Large’s tokenization approach, we aim to capture
several advantages. One of the reasons for this practice is the contextual information present in both complete
to address the absence of certain special characters, such words and sub-word units, enhancing the model’s ability
as “<,” “<=,” and “<>,” in the vocabulary of models like to comprehend and generate meaningful sequences during
Flan-T5. On the other hand, this replacement enhances the subsequent stages of our text-to-SQL task. Moreover,
model understanding as words are typically more we added a prefix “transform:” to each natural language
interpretable and easier for the model to learn. It can lead question. The prefix is specific to the T5-based model,
to improved generalization, as models have an easier time allowing it to recognize the text as a task to be transformed
working with words, which are part of natural language, into SQL queries. For the target, a padding token ID is
compared to arbitrary symbols. In addition, using words set to −100, an adjustment designed to disregard padding
reduces ambiguity, as symbols can be context-dependent tokens during loss calculation.
and unclear.
Table 3 shows a running example for the preprocessing
3.3.3. Converting JSON to CSV of two entries from the MIMICSQL dataset.
After the preprocessing steps, the JSON data were 4. Experiments and results
transformed into CSV format to facilitate compatibility
with data analysis and modeling libraries. The CSV format In this section, we will describe the experimental setup,
with “text” and “sql” columns allowed seamless data infrastructure details and evaluation metrics for the
integration into the training and evaluation processes. experiment, as well as a comparison with other models.
Volume 1 Issue 2 (2024) 100 doi: 10.36922/aih.2661

