Page 100 - GTM-4-3
P. 100
Global Translational Medicine CNNs for overfitting and generalizability in fracture detection
Table 3. Analysis of validation practices in recent fracture detection studies relative to the Food and Drug Administration’s
guidance
Algorithm Accuracy (%) Data splitting Learning curves External data testing Year Reference
MobileNet 99 Two-way Yes, show convergence issues a No 2025 31
FracNet 100 Three-way No No 2025 32
Various 64–92 Two-way No No 2023 33
Canny 90 Not available No No 2025 34
SimCLR 94 Two-way Yes, show convergence issues b No 2024 35
Note: Analysis of validation practices in recent fracture detection studies relative to Food and Drug Administration’s guidance, contrasting reporting
30
gaps in training dynamics and external generalizability. Methodologies frequently omit evidence of training-validation convergence and external
performance benchmarking. Many report inflated accuracy metrics using internal validation via two-way splitting, which obscures patient overlap
risks and prevents cross-study comparability of clinical utility. Notably absent are learning curves demonstrating harmonized training dynamics or
stress-testing against distribution shifts, undermining confidence in real-world reliability. While some studies include learning curves, these often
b
a
reveal erratic validation trajectories indicative of improper regularization or dataset leakage. refers to inverted learning curves, and refers to identical
convergence trajectories.
Abbreviation: SimCLR: A simple framework for contrastive learning of visual representations.
input. To mitigate these effects, we advocate multicenter collectively make up about 30% of the population (based
35
validation frameworks that: (i) Prospectively harmonize on United States census data ) inevitably leads to a
38
imaging protocols across sites using Digital Imaging and dataset composition that reflects this distribution. Future
39
Communications in Medicine metadata standardization, work should explicitly recruit these cohorts to assess
(ii) Implement continuous test-time adaptation via performance across developmental and degenerative bone
adversarial domain-invariant training, and (iii) Adopt phenotypes. Hence, key limitations include:
36
federated learning architectures to pool heterogeneous (i) Dataset diversity gaps: Underrepresentation of
data while preserving institutional privacy. 37 pediatric/geriatric populations and pathologic
Our results demonstrate that conventional single- fractures. However, clinical applicability depends on
center holdout validation—even with rigorous k-fold generalizability across institutions, not demographic
splits—overestimates real-world performance by up to alignment. Including age-specific tuning could
11.3% (external vs. best-case validation accuracy). This paradoxically reduce robustness by overfitting to non-
aligns with the emerging consensus that external validation generalizable population features
should precede clinical implementation, supplemented (ii) Label noise: Retrospective ground truth from clinical
by stress-testing against rare but critical edge cases reports inherits inter-observer variability, with up to a
(e.g., pediatric buckle fractures). 14% discordance rate (the proportion of cases where
annotators disagreed on fracture presence) in subtle
4.5. Limitations and future directions fracture annotation 1
While the model demonstrated robustness across (iii) Operational fragility: Performance degrades when
two independent datasets, three inherent limitations faced with non-standard views (e.g., oblique
merit clarification. First, this study did not curate or projections) not included in training.
harmonize patient ages, as neither source dataset included To address these, future work should (i) Develop
demographic metadata. This reflects real-world clinical synthetic data augmentation pipelines tuned to rare fracture
deployments where AI tools process images without phenotypes using diffusion models, (ii) Implement
40
comprehensive patient histories, prioritizing fracture triple-annotation protocols with orthopedist adjudication
morphology over population characteristics. Second, to minimize label noise, and (iii) Integrate attention
the underrepresentation of pathological fractures (those mechanisms focusing on cortical discontinuity
arising from underlying disease processes like metastatic (interruption of bone cortex) and periosteal reactions
cancer or osteoporosis) versus traumatic fractures (bone healing responses)—morphological hallmarks less
(mechanical injuries in structurally normal bone) poses a sensitive to imaging artifacts.
distinct challenge. Third, despite multi-dataset integration,
sample scarcity persists for pediatric (<18 years) and older In addition, the “black box” nature of CNNs and deep
(>65 years) populations, an unavoidable constraint given NNs limits interpretability, which may hinder clinical
their smaller population proportions (~30% collectively). trust and impede its seamless integration into diagnostic
Naturally, the fact that under-18 and over-65 age groups workflows. Finally, the retrospective design of this study
36
Volume 4 Issue 3 (2025) 92 doi: 10.36922/gtm.8526

