Page 105 - AIH-2-4

P. 105

Artificial Intelligence in Health AI vs humans in clinical code conversion

required to process this data – in addition to further clinical or hospital environment. GenAI tools remain
36
messages to adapt and rectify the prompt if it was not an accessible and easy-to-use alternative that requires
processing correctly – the message limit was quickly minimal training to achieve a cost- and time-efficient
reached and required waiting until the window had outcome. Additionally, these tools are rapidly improving
lapsed before proceeding with the rest of the task. This over time, potentially simplifying the task even further.
drastically inflated the timeframe in which the task could
be completed. 4.1. Study limitations
Claude 3.5 Sonnet provided a more streamlined tool that Although this case study provides valuable insights into
did not require as much skill or time to produce a prompt. the use of GenAI to complete a large-scale health data
One key limitation of Claude 3.5 Sonnet was the process analysis task, several limitations still remain. Firstly, given
of importing and exporting data. Unlike ChatGPT-4o, at that this is an Australian dataset, the SNOMED-CT codes
the time of the study, Claude 3.5 Sonnet did not have the came from the Australian edition (SNOMED-CT-AU)
functionality to directly import or export Microsoft Excel whilst the I-MAGIC tool only caters to the standard
files; however, this functionality has since been added with version. Therefore, this may account for why some codes
the release of Claude 4.0 Sonnet. Therefore, it was necessary were unable to be manually converted using the I-MAGIC
to copy and paste lines of data from the Microsoft Excel tool. Additionally, multiple raters were required to
file into Claude 3.5 Sonnet. This led to a further limitation, complete the manual coding task, thereby introducing
which was the restrictions on both message length and potential issues around inter-rater reliability, particularly
the number of messages permitted. As the amount of data when coders were less familiar with the task. Furthermore,
exceeded the input limit, it was necessary to break up the the I-MAGIC tool currently uses ICD-10-CM and has
prompt into smaller, more manageable batches of codes not yet been updated for the new edition of the ICD (i.e.,
th
(i.e., 500 lines at a time). 11 edition). There is currently no mapping tool available
that enables SNOMED CT to be converted to the newer
Although Claude 3.5 Sonnet did not appear to version of the ICD.
“hallucinate” with a greater number of conversions, only
50 codes could be converted at a time due to limits on the In addition, this study only considered ICD-10-CM
maximum output message length. This however meant codes to be “correct” if they either perfectly or partially
that the message limit (approximately 45 messages every matched the manual code. Given that the aim of this study
5 h, dependent on message length) was quickly consumed. was to examine whether this task could be completed
Given that Claude 3.5 Sonnet processed codes significantly using GenAI, it was outside of the scope of the study to
faster than ChatGPT-4o, this led to a longer waiting period manually examine each “incorrect” match to determine
between exceeding the message limit and its renewal. whether it was clinically valid. However, this is likely
As Claude 3.5 Sonnet was unable to directly export a to have significantly impacted the results and led to an
Microsoft Excel file at the end of the task, this significantly underestimation of the level of agreement between the
increased the time burden, as it was necessary to produce R GenAI tools and manual ratings.
Studio code to be run in order to produce the final output A further limitation of this study is the rapid pace at
dataset. In addition to requiring the worker to have some which GenAI tools are being developed and improved. It
knowledge of how to run the code in R Studio, this step is likely that in the time since this study was conducted,
accounted for the majority of the time taken to complete newer tools have been released that may yield different
the task. For instance, it took 1 h and 15 min to complete results in terms of accuracy and processing speed.
the code conversion, with the remainder of the time (1 h However, these advancements will likely only improve the
and 55 min) spent writing and executing the R Studio overall efficiency and accuracy of GenAI tools.
code. The ability to produce downloadable Microsoft Excel
files within Claude 3.5 Sonnet would rectify this limitation, 4.2. Recommendations for future research
significantly reducing the time and cost required to There is significant scope for future research within this
complete data analysis. field. Firstly, further analysis of the produced data from
Although other methods are available for large-scale this study is planned to examine the clinical validity of
data extraction tasks, such as the creation of Application partial or incorrect matches, which will further strengthen
Programming Interfaces, these may require technical skill the results of this study by producing more accurate
and knowledge to set up. These may also be cumbersome ratings between the GenAI and manual coding output.
and impractical for ad hoc tasks performed by individuals This study used the paid versions of both ChatGPT-4o and
lacking programming skills, particularly those in a busy Claude 3.5 Sonnet, which offer additional functionalities

Volume 2 Issue 4 (2025) 99 doi: 10.36922/AIH025200045

100 101 102 103 104 105 106 107 108 109 110