Page 103 - AIH-1-2
P. 103
Artificial Intelligence in Health Schema-less text2sql conversion with LLMs
enabling users to formulate queries in natural language, As we delve into these research questions, our
thereby lowering the barriers to data access and analysis. methodology strategically leverages schema-less
In the past decade, the field of natural language questions, with a deliberate focus on mitigating the
processing (NLP), especially through the development challenges posed by complex and lengthy input prompts.
of LLMs, has seen remarkable progress, substantially While acknowledging that this approach may potentially
1,2
enhancing text-to-SQL systems’ performance. Models limit generalization across diverse database schemas, we
such as T5, LLaMA, GPT-3, GPT-3.5, and GPT-4 have anticipate that the pronounced enhancement in overall
been pivotal in advancing natural language understanding performance will substantiate this deliberate trade-off.
and generation, displaying a profound ability to process The organization of this paper is as follows: Section 2
and produce human-like text. Despite these advancements, presents a thorough review of the existing literature in the
adapting these versatile models for specific applications, text-to-SQL field. Section 3 describes our methodology,
such as generating SQL queries for structured data, including details about the MIMICSQL dataset,
remains a significant challenge. preprocessing steps, and the fine-tuning process. Section
In this research, we aim to tackle the dual challenges 4 discusses the experimental setup, covering evaluation
of simplifying input prompts and elevating accuracy in metrics, comparison methods, experimental results,
the generation of SQL queries, with a specific focus on and their analysis. The final section concludes the paper,
the intricate landscape of the medical domain. Given the summarizing our contributions and highlighting the
critical importance of precision in data retrieval within significance of applying LLMs to the text to-SQL task, with
healthcare contexts, our primary goal is to fine-tune a special emphasis on schema-less querying.
Flan-T5-based models using text-to-SQL query pairs 2. Related works
meticulously tailored for the medical MIMICSQL dataset.
3
The decision to utilize a medical dataset in our research The task of text-to-SQL is to convert natural utterances
is driven by the distinctive challenges and precision into SQL queries. This field has attracted researchers in
requirements inherent in health-care data retrieval. the NLP and the database community for decades.
4-9
The choice of the MIMICSQL dataset, derived from the The methodologies currently in use to handle this task
widely-used MIMIC-III database, provides a realistic can be broadly divided into three categories: rule-based
and clinically relevant context, allowing us to address the methods, fine-tuning methods, and in-context learning
complexities of real-world medical scenarios. Focusing (ICL) methods. Rule-based approaches, as highlighted in
on the medical domain enables us to tailor our approach other studies, utilize predefined templates to generate
7,10
to the unique intricacies of healthcare data, contributing SQL queries. These methods show proficiency in certain
directly to advancements in medical data management. By scenarios but are limited by the necessity for manual rule
enhancing the accuracy of SQL query generation in this formulation, which restricts their versatility across diverse
specific context, our research seeks to deliver a meaningful domains.
impact on the efficiency of data retrieval in medical Addressing the limitations of rule-based methods,
databases, benefiting health-care professionals, researchers recent research has ventured into more flexible approaches.
and decision-makers. The utilization of bi-directional long-short-term memory
To guide our investigation effectively, we pose the and convolutional networks in Seq2Seq models
11
following research questions: has enhanced adaptability and effectiveness, though
(i) How can we optimize the formulation of input prompts integrating structural database information remains a
to simplify the querying process while maintaining persistent challenge. Graph neural networks have emerged
the necessary specificity required for medical data as a solution to this, with approaches that treat the database
retrieval? schema as a graph, as seen in other works. 12,13 Furthermore,
(ii) What adjustments and enhancements can be made to the introduction of the MIMICSQL dataset and a model-
Flan-T5-based models to improve their accuracy in based system by Translate-Edit Model for Question-to-SQL
generating SQL queries tailored to the nuances of the (TREQS) marked a significant advancement in text-to-
medical MIMICSQL dataset? SQL, particularly in the medical domain. Their model sets
(iii) How do schema-less questions contribute to a robust baseline for subsequent evaluations. In our study,
streamlining input prompt complexity, and what we use the MIMICSQL dataset and the TREQS model as
impact does this simplification have on the overall benchmarks to evaluate and compare the effectiveness of
performance of SQL query generation in the medical our proposed method. In addition, fine-tuning pretrained
domain? language models like T5 have demonstrated improved
Volume 1 Issue 2 (2024) 97 doi: 10.36922/aih.2661

