Page 98 - GTM-4-3
P. 98
Global Translational Medicine CNNs for overfitting and generalizability in fracture detection
impact of domain shift in medical AI, where protocol
heterogeneity (e.g., X-ray exposure parameters, sensor
resolutions) alters low-level image textures critical for CNN
feature extraction. While the model maintained high
28
recall (94.2% externally), precision declined significantly
(Δ = −6.1% relative to test data), reflecting fragility
in distinguishing fractures from anatomical mimics
(e.g., nutrient canals) under distributional mismatch.
These findings mirror broader ML challenges where
aggregate metrics mask subtype-specific vulnerabilities.
9,18
Figure 6. Trends in performance metrics (accuracy, precision, recall, Fracture subtypes underrepresented in training data
and F1-score) for the validation, test, and external datasets. The figure (e.g., greenstick, pathologic) exhibited higher misclassification
emphasizes the model’s robustness and high recall, critical for minimizing rates, emphasizing the need for granular performance reporting
missed diagnoses, while precision shows a slight decline on external data across morphological categories. The model’s reliance on
due to increased false positives.
single-institution datasets—despite k-fold validation—
limited its capacity to generalize across demographic and
performance across all metrics. This drop is expected due regional variations in fracture etiology, echoing concerns in
to differences in data distribution between the external clinical trial generalizability. Future studies should stratify
5
dataset and the training data. The lower precision indicates results by fracture type and demographic covariates to better
a higher rate of FPs, which could lead to unnecessary quantify real-world applicability.
follow-up investigations in clinical settings.
While earlier studies in AI-driven fracture detection
Across all datasets, recall remains consistently high, frequently emphasize high diagnostic accuracy, our
with the external dataset achieving a value of 94.2%. This results underscore the necessity of evaluating such claims
suggests the model is effective at identifying fractures, against methodological transparency and validation rigor.
which is critical in minimizing missed diagnoses. Prior research often reports exceptional performance
Precision, however, decreases more significantly in the metrics derived from protocols that insufficiently address
external dataset, highlighting the model’s tendency to data leakage or domain heterogeneity—a limitation
produce more FPs when applied to data from a different exemplified by the reliance on single-institution datasets
distribution. Hence, the trends as shown in Figure 6 and inconsistent reporting of validation practices. In
reflect a strong model performance on internal datasets, contrast, our approach prioritized strict separation of
with a predictable decline in metrics when applied to training, validation, and testing phases, supplemented
external data. This emphasizes the importance of external by external validation to assess real-world applicability.
validation and the need for careful consideration of FPs in This framework aligns with recent regulatory guidelines,
clinical applications. including the Food and Drug Administration (FDA)’s
emphasis on robust validation practices for AI models in
4. Discussion clinical settings. 29
The model’s diagnostic performance and limitations were The observed performance gradient between internal
analyzed through four critical aspects, assessing how and external evaluations reflects methodological
technical achievements in controlled validation translate to divergences (i.e., systematic differences in validation
clinical utility amidst real-world heterogeneity in imaging frameworks, data handling, or evaluation criteria across
protocols, population characteristics, and operational studies) rather than algorithmic shortcomings. While
constraints. many existing models derive metrics from optimistically
4.1. Performance and generalizability partitioned data, our use of cross-validation and
architectural safeguards—such as dropout and batch
The CNN demonstrated robust diagnostic performance, normalization—reduced overfitting risks and maintained
achieving 95.8% validation accuracy and 94.5% test training stability. External validation revealed predictable
accuracy, with k-fold cross-validation (k = 5) confirming declines in precision, a pattern consistent with domain
stability (95% average accuracy). However, the 8.3% decline shift challenges seen across medical AI. These findings
in external test accuracy (91.7%) underscores contextual highlight a broader issue: nominal accuracy disparities
challenges in generalizability. This gradient—from often signal discrepancies in validation practices rather
validation to external testing—aligns with the expected than true model capabilities.
Volume 4 Issue 3 (2025) 90 doi: 10.36922/gtm.8526

