Page 104 - AIH-2-4
P. 104
Artificial Intelligence in Health AI vs humans in clinical code conversion
Of the two GenAI methods, Claude 3.5 Sonnet was ICD-10-CM code (i.e., “Z00.129 Encounter for routine
the most time- and cost-efficient, requiring 3 h and child health examination without abnormal findings”).
10 min (AUD$195.30, including subscription cost). This suggests that further formal analysis may demonstrate
ChatGPT-4o nearly doubled the time and cost of Claude that GenAI tools outperform human raters. Therefore,
3.5 Sonnet, taking 5 h and 45 min (AUD$195.30, it is likely that the results of this study significantly
including subscription cost). Regardless, ChatGPT-4o still underestimate the accuracy and clinical validity of the
demonstrated significant time and cost savings compared matches produced by the GenAI tools.
to manual conversion. Despite GenAI tools demonstrating significant
4. Discussion time and cost savings, several challenges were noted
throughout the conversion process. With regards to
This evaluation provides a case study to investigate the ChatGPT-4o, the process of performing the SNOMED
ability of GenAI tools to process and analyze large-scale CT-AU to ICD-10-CM conversion was not fully
healthcare datasets. To the authors’ knowledge, this study automated, nor was it straightforward for someone
is the first to challenge GenAI tools to complete a clinical inexperienced with writing GenAI prompts to perform.
diagnostic coding conversion task and to compare the When piloting the prompt, ChatGPT-4o tended to skip
results against those of a manual rater. Conversion of lines, chunks of data, or “hallucinate” (i.e., produce new
clinical diagnostic codes to other coding systems, such as input data that was not provided in the dataset). It was
the task presented in this study, is a complex and time- therefore necessary to explicitly instruct ChatGPT-4o to
consuming task commonly undertaken within healthcare “manually and sequentially” convert the provided codes
data processing. Therefore, this study highlights an and to “…not hallucinate, and only convert codes which
example of a potential use for GenAI within health data have been provided…” and “…not create new codes to
analytics. convert.” When completing the final batch of conversions,
The analysis in this study examined matches found the output had to be monitored for accuracy. Despite not
between the two GenAI tools and the manual rater. The hallucinating during the task, ChatGPT-4o still produced
results indicated that the two GenAI tools showed a higher new input data when it ran out of the codes it had been
level of agreement than either of them did compared to provided.
the manual coding, suggesting that the GenAI methods When providing additional prompts after the algorithm
may employ similar coding strategies or have overlapping had performed well, it was beneficial to provide positive
strengths in code conversion that differ from manual reinforcement to inform ChatGPT-4o that it had
coding approaches. performed the task correctly. This avoided ChatGPT-4o
However, when interpreting these findings, there from changing its original output. There were also
are several caveats to consider. For instance, the clinical instances where ChatGPT-4o would attempt to terminate
validity of ICD codes—particularly in cases where these the task (i.e., “Unfortunately I have run out of time to
were identified as “partial” or “incorrect matches”—was process additional conversions”) but could be prompted
not assessed. This may have resulted in several potentially to continue without further issue. These nuances required
valid codes being incorrectly coded. For example, the some level of skill and familiarity with ChatGPT-4o and
SNOMED code “314041007 Abdominal pain in early GenAI prompts.
pregnancy” was manually converted to “R10.9 Unspecified In terms of the time and labor required, ChatGPT-4o
abdominal pain”. As this formed the benchmark for was not simply a “set and forget” solution to a large data
comparison between the GenAI tools, conversions made task. Due to limitations on the volume of codes it was
by ChatGPT-4o (“O26.83 Pregnancy related abdominal able to process before sometimes hallucinating, a manual
pain”) and Claude 3.5 Sonnet (“O26.892 Other specified “nudge” (i.e., “Please continue with the next batch”) was
pregnancy related conditions, first trimester”) were required after every 25 codes had been converted. This
considered as incorrect matches. required continual monitoring of ChatGPT-4o while
During the analysis, the GenAI tools identified it was processing to ensure that lines of data were not
additional—or arguably better—matches between skipped. Importantly, this renders the task impractical
SNOMED CT and ICD-10-CM. Additionally, there were to complete in the background while undertaking other
several cases where the I-MAGIC tool was unable to work.
generate a match for a SNOMED CT code (e.g., “102508009 ChatGPT-4o also imposes limits on the number of
Well female child”), whereas ChatGPT-4o and Claude 3.5 messages that are permitted within a certain timeframe
Sonnet were both able to produce the same alternative (40 messages every three h). Given the number of nudges
Volume 2 Issue 4 (2025) 98 doi: 10.36922/AIH025200045

