Page 100 - GTM-4-3
P. 100

Global Translational Medicine                          CNNs for overfitting and generalizability in fracture detection




            Table 3. Analysis of validation practices in recent fracture detection studies relative to the Food and Drug Administration’s
            guidance
            Algorithm     Accuracy (%)   Data splitting  Learning curves          External data testing  Year Reference
            MobileNet         99         Two-way        Yes, show convergence issues a  No             2025 31
            FracNet          100         Three-way      No                        No                   2025 32
            Various         64–92        Two-way        No                        No                   2023 33
            Canny             90         Not available  No                        No                   2025 34
            SimCLR            94         Two-way        Yes, show convergence issues b  No             2024 35
            Note: Analysis of validation practices in recent fracture detection studies relative to Food and Drug Administration’s guidance,  contrasting reporting
                                                                                             30
            gaps in training dynamics and external generalizability. Methodologies frequently omit evidence of training-validation convergence and external
            performance benchmarking. Many report inflated accuracy metrics using internal validation via two-way splitting, which obscures patient overlap
            risks and prevents cross-study comparability of clinical utility. Notably absent are learning curves demonstrating harmonized training dynamics or
            stress-testing against distribution shifts, undermining confidence in real-world reliability. While some studies include learning curves, these often
                                                                                                b
                                                                       a
            reveal erratic validation trajectories indicative of improper regularization or dataset leakage.  refers to inverted learning curves, and  refers to identical
            convergence trajectories.
            Abbreviation: SimCLR: A simple framework for contrastive learning of visual representations.
            input.  To mitigate these effects, we advocate multicenter   collectively make up about 30% of the population (based
                 35
            validation frameworks that: (i) Prospectively harmonize   on United States census data ) inevitably leads to a
                                                                                        38
            imaging protocols across sites using Digital Imaging and   dataset composition that reflects this distribution.  Future
                                                                                                      39
            Communications in Medicine metadata standardization,   work should explicitly recruit these cohorts to assess
            (ii) Implement continuous test-time adaptation via   performance across developmental and degenerative bone
            adversarial domain-invariant training,  and (iii) Adopt   phenotypes. Hence, key limitations include:
                                           36
            federated learning architectures to pool heterogeneous   (i)  Dataset diversity gaps: Underrepresentation of
            data while preserving institutional privacy. 37       pediatric/geriatric  populations  and  pathologic
              Our results demonstrate that conventional single-   fractures. However, clinical applicability depends on
            center  holdout  validation—even  with  rigorous  k-fold   generalizability across institutions, not demographic
            splits—overestimates  real-world  performance  by  up  to   alignment. Including  age-specific  tuning could
            11.3% (external vs. best-case validation accuracy). This   paradoxically reduce robustness by overfitting to non-
            aligns with the emerging consensus that external validation   generalizable population features
            should precede clinical implementation, supplemented   (ii)  Label noise: Retrospective ground truth from clinical
            by stress-testing against rare but critical edge cases   reports inherits inter-observer variability, with up to a
            (e.g., pediatric buckle fractures).                   14% discordance rate (the proportion of cases where
                                                                  annotators disagreed on fracture presence) in subtle
            4.5. Limitations and future directions                fracture annotation 1
            While the model demonstrated robustness across     (iii) Operational fragility: Performance degrades when
            two independent datasets, three inherent limitations   faced with non-standard views (e.g., oblique
            merit clarification. First, this study did not curate or   projections) not included in training.
            harmonize patient ages, as neither source dataset included   To  address  these,  future  work  should  (i)  Develop
            demographic metadata. This reflects real-world clinical   synthetic data augmentation pipelines tuned to rare fracture
            deployments where AI tools process images without   phenotypes using diffusion models,  (ii) Implement
                                                                                              40
            comprehensive patient histories, prioritizing  fracture   triple-annotation protocols with orthopedist adjudication
            morphology over population characteristics. Second,   to minimize label noise, and (iii) Integrate attention
            the  underrepresentation  of  pathological  fractures  (those   mechanisms  focusing  on  cortical  discontinuity
            arising from underlying disease processes like metastatic   (interruption of bone cortex) and periosteal reactions
            cancer or osteoporosis) versus traumatic fractures   (bone healing responses)—morphological hallmarks less
            (mechanical injuries in structurally normal bone) poses a   sensitive to imaging artifacts.
            distinct challenge. Third, despite multi-dataset integration,
            sample scarcity persists for pediatric (<18 years) and older   In addition, the “black box” nature of CNNs and deep
            (>65 years) populations, an unavoidable constraint given   NNs limits interpretability, which may hinder clinical
            their smaller population proportions (~30% collectively).   trust and impede its seamless integration into diagnostic
            Naturally, the fact that under-18 and over-65 age groups   workflows.  Finally, the retrospective design of this study
                                                                        36

            Volume 4 Issue 3 (2025)                         92                              doi: 10.36922/gtm.8526
   95   96   97   98   99   100   101   102   103   104   105