Page 104 - AIH-1-2

P. 104

Artificial Intelligence in Health Schema-less text2sql conversion with LLMs

performance in the text-to-SQL domain. 14-16 However, reasons highlighting the choice of fine-tuning FLAN-T5
these fine-tuning methods typically require an extensive for the specific text-to-SQL task:
amount of labeled training data tailored to the specific (i) Fine-tuning FLAN-T5 is important to adapt the model
task, and they are often susceptible to over-fitting. This to specific tasks and improve its performance on those
limitation raises concerns about their versatility and tasks.
efficiency in practical applications. (ii) Fine-tuning allows for customization of the model to
better suit the user’s needs and data.
The advent of LLMs like GPT has opened new avenues in
text-to-SQL tasks, particularly due to their ICL capabilities. (iii) The ability to fine-tune FLAN-T5 on local workstations
These models often outperform fine-tuning methods in with CPUs makes it accessible to a wider range of
users.
various NLP downstream tasks, especially in scenarios (iv) This accessibility is beneficial for smaller organizations
requiring few-shot or zero-shot learning. Nevertheless, the or individual researchers who may not have access to
effectiveness of LLMs heavily relies on the design of input GPU resources.
prompts, a factor that significantly influences the output (v) Overall, fine-tuning FLAN-T5 is a valuable step
quality. 17-19 The ICL performance of LLMs in text-to-SQL in optimizing the model for specific use cases and
tasks, especially the impact of different prompts, has also maximizing its potential benefits.
been examined.
Our emphasis on exploring schema-less approaches
While basic prompting serves as a benchmark for led us to investigate the viability and advantages of
assessing the fundamental capabilities of LLMs, more implementing text-to-SQL systems that depend less on
sophisticated prompt designs have shown to significantly explicit knowledge of database schema.
enhance performance. Notably, a few-shot learning
approach employing GPT-4 recently set a new benchmark 3. Data and methods
in text-to-SQL tasks, achieving state-of-the-art results.
However, this method necessitates manual input for This section delineates the comprehensive methodology
demonstrations and tends to use a large number of tokens, of our study, encompassing a detailed description of the
requiring more time and resources. dataset utilized, the architecture of the model employed,
and the specifics of both the training and evaluation
This study extends the current advancements in LLMs processes.
within the Text-to SQL domain. Specifically, we fine-tune
Flan-T5-based models on the MIMICSQL dataset. Each 3.1. Dataset
of these models is a sequence-to-sequence LLM that can The MIMICSQL dataset is a significant resource for
be also used commercially. The model was published by question-to-SQL generation in the healthcare domain,
Google researchers in late 2022 and has been fine-tuned comprising 10,000 question-SQL pairs. This large-scale
on multiple tasks. It reframes various tasks into a text-to- dataset is based on the Medical Information Mart for
text format, such as translation, linguistic acceptability, Intensive Care III (MIMIC III) dataset, a widely used
sentence similarity, and document summarization. electronic medical records (EMR) database. It is divided
Similarly, the architecture of the Flan-T5 model closely into two subsets: one containing template questions
aligns with the encoder-decoder structure utilized in the (machine-generated) and the other featuring natural
original Transformer paper. The primary distinction lies language questions (human-annotated).
in the size and nature of the training data; Flan-T5 was
trained on an extensive 750 GB corpus of text known as the 3.1.1. Diversity and complexity of the dataset
Colossal Clean Crawled Corpus (C4), and it comes with The MIMICSQL dataset covers a wide range of patient
five variations: flan-t5-small (80M parameters, requiring information categories, including demographics, laboratory
300 MB in memory), flan-t5-base (250M parameters, tests, diagnosis, procedures, and prescriptions. Those
requiring 990 MB in memory), flan-t5-large (780M categories are embedded as a schema structure that outlines
parameters, requiring 1 GB in memory), flan-t5-xl (3B the database’s tables, columns, and interrelationships,
parameters, requiring 12 GB in memory), and flan-t5- serving as a crucial guide for the models to comprehend
xxl (11B parameters, requiring 80 GB in memory). These the database structure and accurately formulate SQL
models can be used for various NLP tasks out-of-the-box queries. Table 1 illustrates the DEMOGRAPHIC table,
(with zero or few shot); however, to leverage its full potential while Table 2 presents the PROCEDURES table from the
and ensure optimal performance for specific applications, MIMICSQL dataset. This diversity reflects the complexity
fine-tuning is a crucial step. Below are the main points and and multidimensionality of healthcare-related queries,

Volume 1 Issue 2 (2024) 98 doi: 10.36922/aih.2661

99 100 101 102 103 104 105 106 107 108 109