Page 104 - AIH-2-4
P. 104

Artificial Intelligence in Health                                    AI vs humans in clinical code conversion



              Of the two GenAI methods, Claude 3.5 Sonnet was   ICD-10-CM code (i.e., “Z00.129 Encounter for routine
            the  most  time-  and  cost-efficient, requiring  3  h  and   child health examination without abnormal findings”).
            10  min (AUD$195.30, including subscription cost).   This suggests that further formal analysis may demonstrate
            ChatGPT-4o nearly doubled the time and cost of Claude   that GenAI tools outperform human raters. Therefore,
            3.5 Sonnet, taking 5  h and 45  min (AUD$195.30,   it is likely that the results of this study significantly
            including subscription cost). Regardless, ChatGPT-4o still   underestimate the accuracy and clinical validity of the
            demonstrated significant time and cost savings compared   matches produced by the GenAI tools.
            to manual conversion.                                Despite GenAI tools demonstrating significant
            4. Discussion                                      time and cost savings, several challenges were noted
                                                               throughout the conversion process. With regards to
            This evaluation provides a case study to investigate the   ChatGPT-4o, the process  of performing the  SNOMED
            ability of GenAI tools to process and analyze large-scale   CT-AU to ICD-10-CM conversion was not fully
            healthcare datasets. To the authors’ knowledge, this study   automated, nor  was  it straightforward  for  someone
            is the first to challenge GenAI tools to complete a clinical   inexperienced with writing GenAI prompts to perform.
            diagnostic  coding  conversion  task  and  to  compare  the   When piloting the prompt, ChatGPT-4o tended to skip
            results against those of a manual rater. Conversion of   lines, chunks of data, or “hallucinate” (i.e., produce new
            clinical diagnostic codes to other coding systems, such as   input data that was not provided in the dataset). It was
            the task presented in this study, is a complex and time-  therefore necessary to explicitly instruct ChatGPT-4o to
            consuming task commonly undertaken within healthcare   “manually and sequentially” convert the provided codes
            data processing. Therefore, this study highlights an   and to “…not hallucinate, and only convert codes which
            example of a potential use for GenAI within health data   have been provided…” and “…not create new codes to
            analytics.                                         convert.” When completing the final batch of conversions,
              The analysis in this study examined matches found   the output had to be monitored for accuracy. Despite not
            between the two GenAI tools and the manual rater. The   hallucinating during the task, ChatGPT-4o still produced
            results indicated that the two GenAI tools showed a higher   new input data when it ran out of the codes it had been
            level of agreement than either of them did compared to   provided.
            the manual coding, suggesting that the GenAI methods   When providing additional prompts after the algorithm
            may employ similar coding strategies or have overlapping   had performed well, it was beneficial to provide positive
            strengths in code conversion that differ from manual   reinforcement to inform ChatGPT-4o that it had
            coding approaches.                                 performed the task correctly. This avoided ChatGPT-4o
              However, when interpreting these findings, there   from changing its original output. There were  also
            are several caveats to consider. For instance, the clinical   instances where ChatGPT-4o would attempt to terminate
            validity of ICD codes—particularly in cases where these   the task (i.e., “Unfortunately I have run out of time to
            were identified as “partial” or “incorrect matches”—was   process additional conversions”) but could be prompted
            not assessed. This may have resulted in several potentially   to continue without further issue. These nuances required
            valid codes being incorrectly coded. For example, the   some level of skill and familiarity with ChatGPT-4o and
            SNOMED code “314041007 Abdominal pain in early     GenAI prompts.
            pregnancy” was manually converted to “R10.9 Unspecified   In terms of the time and labor required, ChatGPT-4o
            abdominal pain”. As this formed the benchmark for   was not simply a “set and forget” solution to a large data
            comparison between the GenAI tools, conversions made   task. Due to limitations on the volume of codes it was
            by ChatGPT-4o (“O26.83 Pregnancy related abdominal   able to process before sometimes hallucinating, a manual
            pain”) and Claude 3.5 Sonnet (“O26.892 Other specified   “nudge” (i.e., “Please continue with the next batch”) was
            pregnancy related conditions, first trimester”) were   required after every 25 codes had been converted. This
            considered as incorrect matches.                   required continual monitoring of ChatGPT-4o while
              During the analysis, the GenAI tools identified   it was processing to ensure that lines of data were not
            additional—or  arguably  better—matches  between   skipped. Importantly, this renders the task impractical
            SNOMED CT and ICD-10-CM. Additionally, there were   to complete in the background while undertaking other
            several cases where the I-MAGIC tool was unable to   work.
            generate a match for a SNOMED CT code (e.g., “102508009   ChatGPT-4o  also  imposes  limits  on  the  number  of
            Well female child”), whereas ChatGPT-4o and Claude 3.5   messages  that  are  permitted  within  a  certain  timeframe
            Sonnet were both able to produce the same alternative   (40 messages every three h). Given the number of nudges


            Volume 2 Issue 4 (2025)                         98                          doi: 10.36922/AIH025200045
   99   100   101   102   103   104   105   106   107   108   109