Page 105 - AIH-2-4
P. 105

Artificial Intelligence in Health                                    AI vs humans in clinical code conversion



            required to process this data – in addition to further   clinical or hospital environment.  GenAI tools remain
                                                                                          36
            messages to adapt and rectify the prompt if it was not   an accessible and easy-to-use alternative that requires
            processing correctly – the message limit was quickly   minimal training to achieve a cost-  and time-efficient
            reached and required waiting until the window had   outcome. Additionally, these tools are rapidly improving
            lapsed before proceeding with the rest of the task. This   over time, potentially simplifying the task even further.
            drastically inflated the timeframe in which the task could
            be completed.                                      4.1. Study limitations
              Claude 3.5 Sonnet provided a more streamlined tool that   Although this case study provides valuable insights into
            did not require as much skill or time to produce a prompt.   the use of GenAI to complete a large-scale health data
            One key limitation of Claude 3.5 Sonnet was the process   analysis task, several limitations still remain. Firstly, given
            of importing and exporting data. Unlike ChatGPT-4o, at   that this is an Australian dataset, the SNOMED-CT codes
            the time of the study, Claude 3.5 Sonnet did not have the   came from the Australian edition (SNOMED-CT-AU)
            functionality to directly import or export Microsoft Excel   whilst the I-MAGIC tool only caters to the standard
            files; however, this functionality has since been added with   version. Therefore, this may account for why some codes
            the release of Claude 4.0 Sonnet. Therefore, it was necessary   were unable to be manually converted using the I-MAGIC
            to copy and paste lines of data from the Microsoft Excel   tool. Additionally, multiple raters were required to
            file into Claude 3.5 Sonnet. This led to a further limitation,   complete the manual coding task, thereby introducing
            which was the restrictions on both message length and   potential issues around inter-rater reliability, particularly
            the number of messages permitted. As the amount of data   when coders were less familiar with the task. Furthermore,
            exceeded the input limit, it was necessary to break up the   the I-MAGIC tool currently uses ICD-10-CM and has
            prompt into smaller, more manageable batches of codes   not yet been updated for the new edition of the ICD (i.e.,
                                                                 th
            (i.e., 500 lines at a time).                       11  edition). There is currently no mapping tool available
                                                               that enables SNOMED CT to be converted to the newer
              Although Claude 3.5 Sonnet did not appear to     version of the ICD.
            “hallucinate” with a greater number of conversions, only
            50 codes could be converted at a time due to limits on the   In addition, this study only considered ICD-10-CM
            maximum output message length. This however meant   codes to be “correct” if they either perfectly or partially
            that the message limit (approximately 45 messages every   matched the manual code. Given that the aim of this study
            5 h, dependent on message length) was quickly consumed.   was  to examine  whether  this  task  could  be  completed
            Given that Claude 3.5 Sonnet processed codes significantly   using GenAI, it was outside of the scope of the study to
            faster than ChatGPT-4o, this led to a longer waiting period   manually examine each “incorrect” match to determine
            between exceeding the message limit and its renewal.   whether it was clinically valid. However, this is likely
            As Claude 3.5 Sonnet  was unable to directly export a   to  have  significantly  impacted  the  results  and  led  to  an
            Microsoft Excel file at the end of the task, this significantly   underestimation of the level of agreement between the
            increased the time burden, as it was necessary to produce R   GenAI tools and manual ratings.
            Studio code to be run in order to produce the final output   A further limitation of this study is the rapid pace at
            dataset. In addition to requiring the worker to have some   which GenAI tools are being developed and improved. It
            knowledge of how to run the code in R Studio, this step   is likely that in the time since this study was conducted,
            accounted for the majority of the time taken to complete   newer tools have been released that may yield different
            the task. For instance, it took 1 h and 15 min to complete   results in terms of accuracy and processing speed.
            the code conversion, with the remainder of the time (1 h   However, these advancements will likely only improve the
            and 55  min) spent writing and executing the R Studio   overall efficiency and accuracy of GenAI tools.
            code. The ability to produce downloadable Microsoft Excel
            files within Claude 3.5 Sonnet would rectify this limitation,   4.2. Recommendations for future research
            significantly reducing the time and cost required to   There is significant scope for future research within this
            complete data analysis.                            field. Firstly, further analysis of the produced data from
              Although other methods are available for large-scale   this  study  is  planned  to  examine  the  clinical  validity  of
            data extraction tasks, such as the creation of Application   partial or incorrect matches, which will further strengthen
            Programming Interfaces, these may require technical skill   the results of this study by producing more accurate
            and knowledge to set up. These may also be cumbersome   ratings between the GenAI and manual coding output.
            and impractical for ad hoc tasks performed by individuals   This study used the paid versions of both ChatGPT-4o and
            lacking programming skills, particularly those in a busy   Claude 3.5 Sonnet, which offer additional functionalities



            Volume 2 Issue 4 (2025)                         99                          doi: 10.36922/AIH025200045
   100   101   102   103   104   105   106   107   108   109   110