Page 102 - AIH-1-2

P. 102

Artificial Intelligence in Health

ORIGINAL RESEARCH ARTICLE
Efficient schema-less text-to-SQL conversion

using large language models

Youssef Mellah*, Veysel Kocaman, Hasham UI Haq, and David Talby
John Snow Labs, Coastal Highway, Lewes, Delaware, United States of America

Abstract

Large language models (LLMs) are increasingly being applied to several tasks
including text-to-SQL (the process of converting natural language to SQL queries).
While most studies revolve around training LLMs on large SQL corpora for better
generalization and then perform prompt engineering during inference, we
investigate the notion of training LLMs for schema-less prompting. In particular, our
approach uses simple natural language questions as input without any additional
knowledge about the database schema. By doing so, we demonstrate that
smaller models paired with simpler prompts result in considerable performance
improvement while generating SQL queries. Our model, based on the Flan-T5
architecture, achieves logical form accuracy (LFA) of 0.85 on the MIMICSQL dataset,
significantly outperforming current state-of-the-art models such as Defog-SQL-
Coder, GPT-3.5-Turbo, LLaMA-2-7B and GPT-4. This approach reduces the model
size, lessening the amount of data and infrastructure cost required for training and
serving, and improves the performance to enable the generation of much complex
SQL queries.

*Corresponding author:
Youssef Mellah Keywords: Large language models; MIMICSQL; Schema-less; Logical form accuracy;
(youssef@johnsnowlabs.com) Defog-SQL-Coder; GPT-3.5-Turbo; LLaMA-2-7B; GPT-4
Citation: Mellah Y, Kocaman V,
Haq HU, Talby D. Efficient schema-
less text-to-SQL conversion using
large language models.
Artif Intell Health. 2024;1(2): 96-106. 1. Introduction
doi: 10.36922/aih.2661 Text-to-SQL technology has gained considerable attention in recent years, emerging as
Received: January 6, 2024 a transformative tool for database interaction. Its key advantage lies in enabling users,
Accepted: February 23, 2024 particularly those with limited SQL knowledge, to use a fine-tuned large language model
(LLM) to interact with databases using natural language. This innovation significantly
Published Online: April 4, 2024 reduces the necessity to learn SQL for data retrieval and analytics from tabular datasets.
Copyright: © 2024 Author(s). The effectiveness of such systems hinges on two main aspects: the intuitiveness of usage
This is an Open-Access article and the accuracy of the generated queries. Essentially, this means that user prompts
distributed under the terms of the
Creative Commons Attribution should be straightforward and the corresponding SQL queries must accurately address
License, permitting distribution, the user’s query with high precision.
and reproduction in any medium,
provided the original work is The growing abundance of structured and semi-structured data in various domains,
properly cited. ranging from e-commerce to healthcare, highlights the importance of the text-to-SQL
Publisher’s Note: AccScience task. This task gains relevance as the demand for more intuitive interfaces to query and
Publishing remains neutral with extract information from these databases increases. Traditional SQL queries, which
regard to jurisdictional claims in
published maps and institutional require understanding of both database schema and query syntax, are often challenging
affiliations. for users lacking technical expertise. Text-to-SQL aims to mitigate this challenge by

Volume 1 Issue 2 (2024) 96 doi: 10.36922/aih.2661

97 98 99 100 101 102 103 104 105 106 107