Page 99 - GTM-4-3
P. 99
Global Translational Medicine CNNs for overfitting and generalizability in fracture detection
Accordingly, we caution against direct performance false impression of perfect generalization, as the model
comparisons without accounting for differences in “validates” on data it has effectively memorized during
evaluation frameworks. Instead, our discussion advocates training. This discrepancy demonstrates how insufficient
for harmonized benchmarking standards that prioritize validation reporting, even when nominally including
rigorous, clinically representative validation—a prerequisite learning curves, permits accuracy inflation through
for bridging the gap between technical achievements and unaddressed data leakage or overfitting while failing to
operational reliability in fracture detection. meet regulatory standards for clinical translatability.
4.2. Cross-study comparison Until standardized validation frameworks (e.g.,
the FDA’s guidance emphasizing training-validation
The above-mentioned inconsistent reporting of validation convergence) are universally accepted, high reported
practices can be seen when analyzing training dynamics accuracies in AI-assisted diagnostic studies remain clinically
in recent studies. For instance, when ML techniques uninterpretable unless accompanied by a demonstration
demonstrate inverted learning curves, i.e., the validation of proper model training and generalizability, as improper
30
accuracy paradoxically exceeds training accuracy, it is a convergence risks inflated metrics that invalidate cross-
hallmark of methodological flaws, including insufficient study comparisons.
data splits, improper hyperparameter tuning, or
unaddressed dataset leakage. These inverted patterns, 4.3. Error analysis and clinical implications
while superficially suggesting high validation performance
(e.g., suspicious 99–100% accuracy reports ), often mask Error patterns revealed asymmetric risks. FNs
31
critical failures in generalizability that only manifest in predominantly occurred in subtle fractures (e.g., hairline
external testing. Such cases exemplify systemic issues in fissures, occult fractures), while FPs arose from anatomical
validation protocols; when models are not stress-tested mimics such as trabecular patterns or overlapping soft
against distribution shifts or required to demonstrate tissues. This reflects clinical realities where radiologists face
harmonized training/validation convergence, nominal similar challenges, though AI may amplify uncertainties
accuracy metrics become dangerously deceptive proxies due to pixel-space decision-making without anatomical
for clinical utility. Our methodology directly counters context.
these risks through iterative learning curve monitoring, The high recall-low precision tradeoff prioritizes
three-way splitting to eliminate patient data overlap, and fracture detection sensitivity but risks overutilization
architectural safeguards (batch normalization, dropout) of confirmatory imaging (computed tomography/
explicitly designed to force alignment between training magnetic resonance imaging). For every 100 external
and validation trajectories; a rigor reflected in our model’s cases, approximately 8.3 FPs would necessitate additional
stable, convergent curves despite more conservative investigations, incurring costs and patient anxiety.
accuracy reporting. Conversely, the 5.8% FN rate (versus 3.2% internally)
The comparison in Table 3 highlights critical gaps in underscores residual risks of delayed treatment, particularly
compliance with FDA-recommended validation protocols, in weight-bearing bones where missed fractures can
particularly regarding convergence analysis essential lead to catastrophic complications. To balance safety
for assessing clinical reliability. While FDA’s guidance and efficiency, clinical deployment should integrate risk-
emphasizes harmonized training-validation trajectories as stratified confidence thresholds—lower thresholds for
evidence of generalizability, most studies either omit this high-stakes anatomical regions (e.g., femoral neck) to
29
analysis entirely or present incomplete evidence. 31-33 Among maximize sensitivity, and higher thresholds for peripheral
the minority that include learning curves, 30,34 many reveal sites to reduce unnecessary imaging.
fundamental inconsistencies; the studies marked ( ) exhibit 4.4. Domain shift and validation best practices
a
inverted validation-training metrics indicating improper
data splits or patient overlap, while ( ) annotations show Domain shift emerged primarily from institutional
b
identical convergence trajectories (i.e., no measurable differences in imaging protocols and population
gap between them). In rigorous ML validation, training characteristics. For instance, external data included a
metrics should show a slight but consistent divergence higher proportion of osteoporosis-related fractures, which
from validation metrics; a controlled gap indicating the present distinct morphological signatures (e.g., compressed
model is learning without overfitting. When curves are versus displaced fractures) compared to trauma-driven
identical, it indicates that the validation set is not truly cases in training data. Protocol variations in beam energy
independent; data from the same patients or images may and collimation further degraded performance by altering
exist in both training and validation splits. This creates a contrast gradients at fracture edges, a critical CNN
Volume 4 Issue 3 (2025) 91 doi: 10.36922/gtm.8526

