Page 107 - AIH-1-2

P. 107

Artificial Intelligence in Health Schema-less text2sql conversion with LLMs

Table 3. A running example for the preprocessing steps
Original text Preprocessed text Tokenized text
Input: Get the number of patients Input: transform: get the number of Input: ['▁transform', ':', '▁Get', '▁the', '▁number', '▁of',
who died in or before 2132 and were patients who died in or before 2132 and '▁patients', '▁who', '▁died', '▁in', '▁or', '▁before', '▁21', '32',
less than 72 years of age. Output: were less than 72 years of age. '▁and', '▁were', '▁less', '▁than', '▁72', '▁years', '▁of', '▁age', '.']
SELECT COUNT (DISTINCT Output: select count (distinct Output: ['▁select', '▁count', '▁(', '▁distinct', '▁demographic',
DEMOGRAPHIC.'SUBJECT_ID') demographic.'subject_id') from '.', "'", 'sub', 'ject', '_', 'i', 'd', "'", '▁', ')', '▁from', '▁demographic',
FROM DEMOGRAPHIC WHERE demographic where demographic.'age' '▁where', '▁demographic', '.', "'", 'age', "'", '▁less', '▁than', "'", '72',
DEMOGRAPHIC.'AGE' < '72' AND less than '72' and demographic.'dod_year' "'", '▁and', '▁demographic', '.', "'", 'd', 'o', 'd', '_', 'year', "'", '▁less',
DEMOGRAPHIC.'DOD_YEAR' less than or equal to '2132.0' '▁than', '▁or', '▁equal', '▁to', '▁', "'", '2', '13', '2.0', "'"]
<= '2132.0'
Input: calculate the minimum days for Input: transform: calculate the minimum Input: ['▁transform', ':', '▁calculate', '▁the', '▁minimum',
which patients aged 20 years or older days for which patients aged 20 years or '▁days', '▁for', '▁which', '▁patients', '▁aged', '▁20', '▁years',
were hospitalized. Output: SELECT older were hospitalized. '▁or', '▁older', '▁were', '▁hospital', 'ized', '.']
MIN (DEMOGRAPHIC.'DAYS_STAY') Output: select min (demographic.'days_ Output: ['▁select', '▁min', '▁(', '▁demographic', '.', "'", 'day',
FROM DEMOGRAPHIC WHERE stay') from demographic where 's', '_', 'stay', "'", '▁', ')', '▁from', '▁demographic', '▁where',
DEMOGRAPHIC.'AGE' >= '20' demographic.'age' greater than or equal '▁demographic', '.', "'", 'age', "'", '▁greater', '▁than', '▁or',
to '20' '▁equal', '▁to', '▁', "'", '20', "'"]

4.1. Experimental setup While evaluating other approaches, significant attention
A key objective of this study is to evaluate the practicality was given to the construction of prompts for LLaMA-
of training and inference processes. For this purpose, we 2-7B, GPT-3.5-Turbo, GPT-4, and DeFog-SQLCoder
employed a standalone machine equipped with a single to effectively generate SQL queries. For these models,
Nvidia V100 GPU (16 GB vRAM) and 32 GB of system prompts were meticulously designed to include schema
memory. information, facilitating the generation of accurate SQL
queries. These prompts are essential for guiding the models
In our implementation, we utilized both Flan-T5 through the task, leveraging their inherent language
Base and Large versions of models, based on the original understanding capabilities. Detailed examples of these
T5 encoder-decoder architecture, augmented with an prompts are provided in Figure 3. Notably, our approach
instruction-finetune mechanism. This architecture consists with the Flan-T5 models is distinct from this conventional
of multiple layers of transformer blocks, including self- method, removing the need for any schema information
attention mechanisms and feed-forward neural networks. in the prompts. In this approach, the Flan-T5 models are
These transformer blocks enable the model to capture fine-tuned in a way that the questions are the only inputs,
long-range dependencies and contextual information from and the database schema can be captured automatically
input sequences. in a more efficient fashion. This distinction underscores
The input and output sequence lengths were the uniqueness and efficiency of our methodology. It is
standardized to 1024 tokens. Sequences exceeding this important to mention that TREQS models, not being
length were truncated, while shorter sequences were padded categorized as LLMs, did not necessitate such prompt-
using a pad token. This sequence length configuration based approaches, but used the schema as input with
enabled a maximum training batch size of two on our GPU the question, further differentiating our method from
setup. Various learning rates and optimizers were tested, traditional practices.
ultimately leading to the selection of the Adam optimizer
with a learning rate of 5e-5. 4.2. Evaluation metric

The fine-tuning phase for each model spanned five The most common and used evaluation metric for the
epochs, a decision based on empirical observations of text-to-SQL task are logical form accuracy (LFA, or
20
convergence and generalization in preliminary trials. This exact matching) and execution accuracy, and since the
epoch count offered an optimal balance between model MIMIQSQL community has not shared databases for
performance and training duration, ensuring the models computing the execution accuracy metric, the primary
adequately learned from the MIMICSQL dataset patterns metric we used for evaluating the models performance
without overfitting. In terms of time, completing five in this study is LFA. In fact, it measures the percentage
epochs of training took approximately 4 h for the Flan-T5 of exact string match between the generated SQL queries
Base version and 7 h for the Flan-T5 Large version. Table 4 and the ground truth SQL queries. It is quantified as the
provides in details the tuning scenario. percentage of instances where the model’s predicted SQL

Volume 1 Issue 2 (2024) 101 doi: 10.36922/aih.2661

102 103 104 105 106 107 108 109 110 111 112