Page 103 - AIH-2-4
P. 103
Artificial Intelligence in Health AI vs humans in clinical code conversion
to address missing values, which were substituted with labor costs associated with each method. The cost of
null codes. completing the task was calculated by multiplying the time
Using this program, the number of matches found for taken for each method by the hourly wage of a research
each comparison was categorized as follows: perfect match, assistant, which was set at AUD$52.20/hour (based on
Level 1 partial match, Level 2 partial match, and incorrect the pay rate for a university-employed research assistant,
match (Table 1). A Chi-squared test of independence was excluding on-costs). Setup costs – namely, the cost of
conducted to determine whether there was a statistically subscribing to ChatGPT-4o or Claude 3.5 Sonnet – were
significant difference in the number of good matches (perfect also included in the total cost calculation.
and Level 2 partial matches) and poor matches (Level 1 3. Results
partial and incorrect matches) across the three methods.
Table 2 displays the number of each type of match found
2.5. Time and cost analysis for each of the comparisons. A Chi-squared test of
The time required to perform conversions in each phase independence was conducted to examine differences in the
was recorded to allow for a comparison of the time and number of good and poor matches among manual coding,
ChatGPT-4.0, and Claude 3.5 Sonnet. The analysis reveals
Table 1. Components of the International Classification of a statistically significant difference in agreement across the
Diseases codes used to identify matches three comparisons (χ² [df = 2] = 56.722, p<0.001).
Level 1 Level 2 Level 3 Agreement on good matches varies considerably
Example ICD Code F 30 .9 between method pairs. The ChatGPT-4.0 and Claude 3.5
Perfect match Yes Yes Yes Sonnet pair show the highest agreement, producing good
matches for 1,520 cases (77.2%) compared to 1,329 cases
Level 2 partial match Yes Yes No (67.5%) for manual coding versus ChatGPT-4o and
Level 1 partial match Yes No No 1,357 cases (68.9%) for manual coding versus Claude 3.5
Incorrect match No No No Sonnet.
Abbreviation: ICD: International Classification of Diseases. Table 3 displays the time and associated cost for a
research assistant to perform data conversions using each
Table 2. Number of correct matches across comparisons tool for the 10% subset (n = 1,976) included in this study. It
Match Manual coding Manual coding ChatGPT‑4o also includes an extrapolated estimate of costs if the entire
category vs. ChatGPT‑4o vs. Claude vs. Claude dataset (n = 19,764) were to be converted from SNOMED
(%) 3.5 Sonnet (%) 3.5 Sonnet (%) to ICD.
Perfect match 578 (29.34) 599 (30.41) 757 (38.43) Of the three methods used, manual coding was the
Level 2 partial 751 (38.12) 758 (38.48) 763 (38.73) most time-consuming and costly, taking 24 h and 31 min
match (AUD$1,279.7) to convert the subset utilized in this study.
Level 1 partial 235 (11.93) 212 (10.76) 230 (11.68) When extrapolated to the full dataset, this method is
match estimated to require 245 h and 12 min, with a labor cost of
Incorrect match 406 (20.61) 401 (20.36) 220 (11.17) AUD$12,799.44.
Table 3. Time and cost for each method
Method and scenario Time FTEs in weeks a Labor cost (AUD) Cost of GenAI tool (AUD) Total cost (AUD)
10% subset (n=1976)
Manual coding 24 h and 31 min 0.64 $1,279.77 N/A $1,279.7731
ChatGPT-4o 5 h and 45 min 0.15 $300.15 $30.00 $330.15
Claude 3.5 Sonnet 3 h and 10 min 0.08 $165.30 $30.00 $195.30
Extrapolation for full dataset (n=19,764)
Manual coding 245 h and 12 min 6.45 $12,799.44 N/A $12,799.44
ChatGPT-4o 57 h and 30 min 1.51 $3,001.50 $30.00 $3,031.50
Claude 3.5 Sonnet 31 h and 40 min 0.83 $1,653.00 $30.00 $1,683.00
Note: Assumes a 38-h work week.
a
Abbreviations: FTE: Full time equivalent; N/A: Not available.
Volume 2 Issue 4 (2025) 97 doi: 10.36922/AIH025200045

