Page 109 - AIH-1-2
P. 109
Artificial Intelligence in Health Schema-less text2sql conversion with LLMs
4.3. Results and discussion (7B parameters), and Defog-SQLCoder (15B parameters)
The evaluation of various text-to-SQL models on the show commendable proficiency, our approach using the
MIMICSQL test set has provided significant insights. The schema-less text-to-SQL with Flan-T5 Large, which has
baseline TREQS model recorded an LFA of 0.48, which only 780M parameters, notably outperforms others. This
marginally increased to 0.55 with the incorporation of a demonstrates not only superior performance but also
recovery technique (TREQS + Recover). The current state- remarkable efficiency, offering transformative potential
of-the-art model, Defog-SQLCoder, achieved an LFA of in both specific domains and broader applications. The
0.65. In comparison, the LLMs GPT 3.5-Turbo and GPT-4 detailed results are tabulated in Table 5.
demonstrated robust performance with LFA scores of 0.60 The results from our comprehensive evaluation shed
and 0.70, respectively, highlighting their applicability. In light on the text-to-SQL domain, underscoring the
addition, the LLaMA-2-7B model, which was fine-tuned significance of language model-based models (LLMs)
for text-to-SQL tasks, attained an LFA of 0.60. Remarkably, and the promising potential of schema-less approaches
our custom fine-tuned model, Flan-T5 Large, surpassed all in healthcare. It is crucial to note that the LLMs under
these models with an LFA of 0.85. scrutiny, specifically LLAMA-2-7B and DeFog-SQLCoder,
were fine-tuned on the text-to-SQL task, encompassing
Figure 4 presents a clear illustration of a sample natural
language query, the ground truth SQL query that would datasets such as MIMICSQL, thereby directly incorporating
knowledge pertinent to this domain. On the other hand, the
accurately respond to this query, and the SQL queries GPT models (GPT-3.5-Turbo and GPT-4) are renowned for
generated by the LLMs used in our experiments, namely, their versatility in evaluating various NLP tasks, including
LLaMA 2-7B, GPT-3.5-Turbo, GPT-4, and DeFog- text-to-SQL, due to their extensive pre-training on diverse
SQLCoder, along with our Flan-T5 models. This comparison corpora. While these models were not specifically fine-
vividly highlights the differences in the query generation tuned on the MIMICSQL dataset, their broad exposure
capabilities of each model, offering a tangible demonstration during pre-training to a wide array of textual and structured
of their respective performances in the text-to-SQL context. data may have contributed to their performance on the
This outcome indicates that while existing models MIMICSQL test set. This factor is important to consider
such as GPT-3.5-Turbo (20B parameters), LLaMA-2-7B when interpreting the comparative performance of these
Figure 4. Sample SQL query generation. This figure illustrates a sample natural language query alongside the corresponding ground truth SQL query and
the SQL queries generated by the evaluated LLMs (LLaMA-2-7B, GPT-3.5-Turbo, GPT-4, and DeFog-SQLCoder) and our Flan-T5 models. In addition,
an augmented version of the ground truth query is presented, serving as an example of how we enriched the training data during the fine-tuning of our
FlanT5 models. It is important to note that this augmentation was exclusively for training purposes; no data in the test set were altered or augmented in
any manner.
Volume 1 Issue 2 (2024) 103 doi: 10.36922/aih.2661

