Page 102 - AIH-1-2
P. 102

Artificial Intelligence in Health





                                        ORIGINAL RESEARCH ARTICLE
                                        Efficient schema-less text-to-SQL conversion

                                        using large language models



                                        Youssef Mellah*, Veysel Kocaman, Hasham UI Haq, and David Talby
                                        John Snow Labs, Coastal Highway, Lewes, Delaware, United States of America



                                        Abstract

                                        Large language models (LLMs) are increasingly being applied to several tasks
                                        including text-to-SQL (the process of converting natural language to SQL queries).
                                        While most studies revolve around training LLMs on large SQL corpora for better
                                        generalization and then perform prompt engineering during inference, we
                                        investigate the notion of training LLMs for schema-less prompting. In particular, our
                                        approach uses simple natural language questions as input without any additional
                                        knowledge about the database  schema. By doing so, we demonstrate that
                                        smaller models paired with simpler prompts result in considerable performance
                                        improvement while generating SQL queries. Our model, based on the Flan-T5
                                        architecture, achieves logical form accuracy (LFA) of 0.85 on the MIMICSQL dataset,
                                        significantly outperforming current state-of-the-art models such as Defog-SQL-
                                        Coder, GPT-3.5-Turbo, LLaMA-2-7B and GPT-4. This approach reduces the model
                                        size, lessening the amount of data and infrastructure cost required for training and
                                        serving, and improves the performance to enable the generation of much complex
                                        SQL queries.

            *Corresponding author:
            Youssef Mellah              Keywords: Large language models; MIMICSQL; Schema-less; Logical form accuracy;
            (youssef@johnsnowlabs.com)  Defog-SQL-Coder; GPT-3.5-Turbo; LLaMA-2-7B; GPT-4
            Citation: Mellah Y, Kocaman V,
            Haq HU, Talby D. Efficient schema-
            less text-to-SQL conversion using
            large language models.
            Artif Intell Health. 2024;1(2): 96-106.   1. Introduction
            doi: 10.36922/aih.2661      Text-to-SQL technology has gained considerable attention in recent years, emerging as
            Received: January 6, 2024   a transformative tool for database interaction. Its key advantage lies in enabling users,
            Accepted: February 23, 2024  particularly those with limited SQL knowledge, to use a fine-tuned large language model
                                        (LLM) to interact with databases using natural language. This innovation significantly
            Published Online: April 4, 2024  reduces the necessity to learn SQL for data retrieval and analytics from tabular datasets.
            Copyright: © 2024 Author(s).   The effectiveness of such systems hinges on two main aspects: the intuitiveness of usage
            This is an Open-Access article   and the accuracy of the generated queries. Essentially, this means that user prompts
            distributed under the terms of the
            Creative Commons Attribution   should be straightforward and the corresponding SQL queries must accurately address
            License, permitting distribution,   the user’s query with high precision.
            and reproduction in any medium,
            provided the original work is   The growing abundance of structured and semi-structured data in various domains,
            properly cited.             ranging from e-commerce to healthcare, highlights the importance of the text-to-SQL
            Publisher’s Note: AccScience   task. This task gains relevance as the demand for more intuitive interfaces to query and
            Publishing remains neutral with   extract information from these databases increases. Traditional SQL queries, which
            regard to jurisdictional claims in
            published maps and institutional   require understanding of both database schema and query syntax, are often challenging
            affiliations.               for users lacking technical expertise. Text-to-SQL aims to mitigate this challenge by


            Volume 1 Issue 2 (2024)                         96                               doi: 10.36922/aih.2661
   97   98   99   100   101   102   103   104   105   106   107