Page 99 - GTM-4-3
P. 99

Global Translational Medicine                          CNNs for overfitting and generalizability in fracture detection



              Accordingly, we caution against direct performance   false impression of perfect generalization, as the model
            comparisons without accounting for differences in   “validates” on data it has effectively memorized during
            evaluation frameworks. Instead, our discussion advocates   training. This discrepancy demonstrates how insufficient
            for harmonized benchmarking standards that prioritize   validation reporting, even when nominally including
            rigorous, clinically representative validation—a prerequisite   learning  curves,  permits  accuracy  inflation  through
            for bridging the gap between technical achievements and   unaddressed  data  leakage  or  overfitting  while  failing  to
            operational reliability in fracture detection.     meet regulatory standards for clinical translatability.

            4.2. Cross-study comparison                          Until  standardized validation frameworks (e.g.,
                                                               the FDA’s guidance emphasizing training-validation
            The above-mentioned inconsistent reporting of validation   convergence) are universally accepted, high reported
            practices can be seen when analyzing training dynamics   accuracies in AI-assisted diagnostic studies remain clinically
            in recent studies. For instance, when ML techniques   uninterpretable unless accompanied by a demonstration
            demonstrate inverted learning curves,  i.e., the validation   of proper model training and generalizability, as improper
                                          30
            accuracy paradoxically exceeds training accuracy, it is a   convergence risks inflated metrics that invalidate cross-
            hallmark of methodological flaws, including insufficient   study comparisons.
            data splits, improper hyperparameter tuning, or
            unaddressed dataset leakage. These inverted patterns,   4.3. Error analysis and clinical implications
            while superficially suggesting high validation performance
            (e.g., suspicious 99–100% accuracy reports ), often mask   Error patterns revealed asymmetric risks. FNs
                                              31
            critical failures in generalizability that only manifest in   predominantly occurred in subtle fractures (e.g., hairline
            external testing. Such cases exemplify systemic issues in   fissures, occult fractures), while FPs arose from anatomical
            validation protocols; when models are not stress-tested   mimics such as trabecular patterns or overlapping soft
            against distribution shifts or required to demonstrate   tissues. This reflects clinical realities where radiologists face
            harmonized  training/validation  convergence,  nominal   similar challenges, though AI may amplify uncertainties
            accuracy  metrics  become  dangerously  deceptive  proxies   due to pixel-space decision-making without anatomical
            for clinical utility. Our methodology directly counters   context.
            these risks through iterative learning curve monitoring,   The  high  recall-low  precision  tradeoff  prioritizes
            three-way splitting to eliminate patient data overlap, and   fracture detection sensitivity but risks overutilization
            architectural safeguards (batch normalization, dropout)   of confirmatory imaging (computed tomography/
            explicitly designed to force alignment between training   magnetic resonance imaging). For every 100 external
            and validation trajectories; a rigor reflected in our model’s   cases, approximately 8.3 FPs would necessitate additional
            stable, convergent curves despite more conservative   investigations,  incurring  costs  and  patient  anxiety.
            accuracy reporting.                                Conversely, the 5.8% FN rate (versus 3.2% internally)
              The comparison in Table 3 highlights critical gaps in   underscores residual risks of delayed treatment, particularly
            compliance with FDA-recommended validation protocols,   in weight-bearing bones where missed fractures can
            particularly regarding convergence analysis essential   lead to catastrophic complications. To balance safety
            for  assessing clinical  reliability.  While  FDA’s guidance   and efficiency, clinical deployment should integrate risk-
            emphasizes harmonized training-validation trajectories as   stratified confidence thresholds—lower thresholds for
            evidence of generalizability,  most studies either omit this   high-stakes anatomical regions (e.g., femoral neck) to
                                  29
            analysis entirely or present incomplete evidence. 31-33  Among   maximize sensitivity, and higher thresholds for peripheral
            the minority that include learning curves, 30,34  many reveal   sites to reduce unnecessary imaging.
            fundamental inconsistencies; the studies marked ( ) exhibit   4.4. Domain shift and validation best practices
                                                   a
            inverted validation-training metrics  indicating improper
            data splits or patient overlap, while ( ) annotations show   Domain  shift  emerged  primarily  from  institutional
                                          b
            identical convergence trajectories (i.e., no measurable   differences in imaging protocols and population
            gap  between  them).  In  rigorous  ML  validation,  training   characteristics. For instance, external data included a
            metrics should show a slight but consistent divergence   higher proportion of osteoporosis-related fractures, which
            from validation metrics; a controlled gap indicating the   present distinct morphological signatures (e.g., compressed
            model  is  learning  without  overfitting.  When  curves  are   versus displaced fractures) compared to trauma-driven
            identical, it indicates that the validation set is not truly   cases in training data. Protocol variations in beam energy
            independent; data from the same patients or images may   and collimation further degraded performance by altering
            exist in both training and validation splits. This creates a   contrast gradients at fracture edges, a critical CNN


            Volume 4 Issue 3 (2025)                         91                              doi: 10.36922/gtm.8526
   94   95   96   97   98   99   100   101   102   103   104