Page 103 - AIH-2-4
P. 103

Artificial Intelligence in Health                                    AI vs humans in clinical code conversion



            to address missing values, which were substituted with   labor costs associated with each method. The cost of
            null codes.                                        completing the task was calculated by multiplying the time
              Using this program, the number of matches found for   taken for each method by the hourly wage of a research
            each comparison was categorized as follows: perfect match,   assistant, which was set at AUD$52.20/hour (based on
            Level 1 partial match, Level 2 partial match, and incorrect   the pay rate for a university-employed research assistant,
            match (Table 1). A Chi-squared test of independence was   excluding on-costs). Setup costs – namely, the cost of
            conducted to determine whether there was a statistically   subscribing to ChatGPT-4o or Claude 3.5 Sonnet – were
            significant difference in the number of good matches (perfect   also included in the total cost calculation.
            and Level 2 partial matches) and poor matches (Level 1   3. Results
            partial and incorrect matches) across the three methods.
                                                               Table 2 displays the number of each type of match found
            2.5. Time and cost analysis                        for each of the comparisons. A  Chi-squared test of
            The time required to perform conversions in each phase   independence was conducted to examine differences in the
            was recorded to allow for a comparison of the time and   number of good and poor matches among manual coding,
                                                               ChatGPT-4.0, and Claude 3.5 Sonnet. The analysis reveals
            Table 1. Components of the International Classification of   a statistically significant difference in agreement across the
            Diseases codes used to identify matches            three comparisons (χ² [df = 2] = 56.722, p<0.001).

                               Level 1   Level 2    Level 3      Agreement  on  good  matches  varies  considerably
            Example ICD Code     F         30         .9       between method pairs. The ChatGPT-4.0 and Claude 3.5
            Perfect match       Yes        Yes       Yes       Sonnet pair show the highest agreement, producing good
                                                               matches for 1,520 cases (77.2%) compared to 1,329 cases
            Level 2 partial match  Yes     Yes        No       (67.5%) for manual coding versus ChatGPT-4o and
            Level 1 partial match  Yes     No         No       1,357 cases (68.9%) for manual coding versus Claude 3.5
            Incorrect match     No         No         No       Sonnet.
            Abbreviation: ICD: International Classification of Diseases.  Table 3 displays the time and associated cost for a
                                                               research assistant to perform data conversions using each
            Table 2. Number of correct matches across comparisons  tool for the 10% subset (n = 1,976) included in this study. It

            Match       Manual coding  Manual coding   ChatGPT‑4o   also includes an extrapolated estimate of costs if the entire
            category   vs. ChatGPT‑4o   vs. Claude   vs. Claude   dataset (n = 19,764) were to be converted from SNOMED
                           (%)     3.5 Sonnet (%)  3.5 Sonnet (%)  to ICD.
            Perfect match  578 (29.34)  599 (30.41)  757 (38.43)  Of the three methods used, manual coding was the
            Level 2 partial   751 (38.12)  758 (38.48)  763 (38.73)  most time-consuming and costly, taking 24 h and 31 min
            match                                              (AUD$1,279.7) to convert the subset utilized in this study.
            Level 1 partial   235 (11.93)  212 (10.76)  230 (11.68)  When  extrapolated to  the  full  dataset,  this  method  is
            match                                              estimated to require 245 h and 12 min, with a labor cost of
            Incorrect match  406 (20.61)  401 (20.36)  220 (11.17)  AUD$12,799.44.


            Table 3. Time and cost for each method
            Method and scenario   Time          FTEs in weeks a  Labor cost (AUD)  Cost of GenAI tool (AUD)  Total cost (AUD)
            10% subset (n=1976)
             Manual coding        24 h and 31 min  0.64         $1,279.77           N/A            $1,279.7731
             ChatGPT-4o           5 h and 45 min   0.15         $300.15            $30.00           $330.15
             Claude 3.5 Sonnet    3 h and 10 min   0.08         $165.30            $30.00           $195.30
            Extrapolation for full dataset (n=19,764)
             Manual coding        245 h and 12 min  6.45       $12,799.44           N/A            $12,799.44
             ChatGPT-4o           57 h and 30 min  1.51         $3,001.50          $30.00           $3,031.50
             Claude 3.5 Sonnet    31 h and 40 min  0.83         $1,653.00          $30.00           $1,683.00
            Note:  Assumes a 38-h work week.
                a
            Abbreviations: FTE: Full time equivalent; N/A: Not available.

            Volume 2 Issue 4 (2025)                         97                          doi: 10.36922/AIH025200045
   98   99   100   101   102   103   104   105   106   107   108