Page 107 - AIH-1-4

P. 107

Artificial Intelligence in Health ChatGPT in visceral leishmaniasis diagnosis

Table 1. (Continued)
Case Description
PE: Weight: 69 kg; height: 1.56 m; BP: 100/60 mmHg; abdomen: hepatomegaly 3 cm below the costal margin, splenomegaly 5 cm below the
costal margin, no ascites, abdominal tenderness on palpation; cardiovascular: regular cardiac rhythm, heart rate 76 bpm; respiratory: clear
breath sounds, no wheezes or crackles; skin/mucous membranes: no jaundice, pale, no mucosal lesions.
Abbreviations: BP: Blood pressure; HPI: History of present illness; ID: Identifying information; PE: Physical examination; PMH: Past medical history;
SH: Social history.

The second investigator in this study (D.S.) employed LLC) was presented 6 times, representing 75% of the total
a similar methodology to that performed by Hirosawa number of cases (95% CI: 40.1 – 93.7%). Table 2 shows the
et al. by typing the following text into the ChatGPT (GPT five differential diagnoses presented by ChatGPT/GPT-4
21
4.0, OpenAI OpCo, LLC) prompt in Brazilian Portuguese: for each clinical vignette.
“Please provide me with the five most likely diagnoses for While ChatGPT/GPT-4 did not provide an accurate
the following symptoms: (copy and paste each clinical representation of VL as a diagnostic possibility for the clinical
vignette).” The order of the clinical vignettes presented vignettes containing cases 03 and 04, it did report VL as the
to ChatGPT/GPT-4 was randomized using a computer- top diagnosis for four cases (50.0%; 95% CI9: 30.3 – 86.5%).
generated order table (Case 02, 08, 04, 01, 06, 05, 03, and 07). Figure 1 shows the accuracy of ChatGPT/GPT-4 in
To ensure the integrity of the data and to avoid any influence presenting VL as a differential diagnosis (Figure 1A) and
of previous interactions, each clinical vignette was presented as the principal diagnosis (Figure 1B).
to ChatGPT/GPT-4 only once in a new chat session. This
approach was employed to prevent any potential influence 4. Discussion
of previous interactions on the AI’s responses. 21
The ability of ChatGPT to provide diagnostic support,
2.4. Measurements and definitions especially in resource-limited settings where access to
The accuracy of the VL diagnosis was evaluated based on specialized medical expertise is limited, is one of its most
the inclusion of the correct diagnosis within the top five promising contributions to healthcare. By providing
differential diagnoses generated by ChatGPT (GPT 4.0, reliable differential diagnoses, ChatGPT has the potential
OpenAI OpCo, LLC). This approach employed a binary to bridge gaps in medical expertise, enabling more timely
scoring system, whereby the presence of a diagnosis in and accurate clinical decision-making in underserved
the list was scored as one, and its absence was scored as areas.
zero. Furthermore, the position of the VL diagnosis within This exploratory study evaluated the diagnostic
the lists, classified between first and fifth, was analyzed accuracy of ChatGPT/GPT-4 in generating differential
sequentially. diagnosis lists for clinical vignettes of VL. The results
showed that ChatGPT/GPT-4 correctly included VL in
2.5. Statistical analysis the top five differential diagnoses in 75% of cases. Notably,
The responses were entered into regular Excel spreadsheets ChatGPT/GPT-4 identified VL as the top diagnosis in 50%
(Microsoft Corporation, Redmond, WA, USA, Release of these cases. These results indicate that ChatGPT (GPT
12.0.6662, 2012) and exported to the Statistical Package 4.0, OpenAI OpCo, LLC) has a high potential to aid in the
for the Social Science for Windows (SPSS Inc., Chicago, diagnosis of VL, as evidenced by its significant accuracy in
Illinois, USA, Release 16.0.2, 2008) for statistical analysis. generating relevant differential diagnoses.
Descriptive statistical analysis was performed on The findings of our study are consistent with a growing
categorical variables, which were presented as absolute and body of research demonstrating the diagnostic capabilities
relative frequencies. The accuracy of ChatGPT/GPT-4 as of AI chatbots. For example, Hirosawa et al. evaluated the
21
an AI-assisted diagnostic tool for VL was calculated using diagnostic accuracy of differential diagnosis lists generated
the prevalence ratio, and its inaccuracy was estimated using by ChatGPT/GPT-3.5 on January 5, 2023, for clinical
a 95% confidence interval (95% CI). Statistical analyses vignettes with common chief complaints. Their results
were conducted in a two-tailed manner, and statistical showed that the correct diagnosis was included within the
significance was set at P < 0.05. top ten differential diagnoses in 93.3% of cases. Similarly, a
22
3. Results study by Mizuta et al. showed that ChatGPT/GPT-4 had
an elevated level of agreement (95.9%) with physicians in
The correct diagnosis of VL among the five differential determining whether the correct diagnosis was included in
diagnoses generated by ChatGPT (GPT 4.0, OpenAI OpCo, the top ten differential diagnosis lists.

Volume 1 Issue 4 (2024) 101 doi: 10.36922/aih.3930

102 103 104 105 106 107 108 109 110 111 112