Page 98 - GTM-4-3
P. 98

Global Translational Medicine                          CNNs for overfitting and generalizability in fracture detection



                                                               impact of domain shift in medical AI, where protocol
                                                               heterogeneity (e.g., X-ray exposure parameters, sensor
                                                               resolutions) alters low-level image textures critical for CNN
                                                               feature extraction.  While the model maintained high
                                                                              28
                                                               recall (94.2% externally), precision declined significantly
                                                               (Δ  =  −6.1%  relative  to  test  data),  reflecting  fragility
                                                               in distinguishing fractures from anatomical mimics
                                                               (e.g., nutrient canals) under distributional mismatch.
                                                                 These findings mirror broader ML challenges where
                                                               aggregate metrics mask subtype-specific vulnerabilities.
                                                                                                           9,18
            Figure  6. Trends in performance metrics (accuracy, precision, recall,   Fracture subtypes underrepresented in training data
            and F1-score) for the validation, test, and external datasets. The figure   (e.g., greenstick, pathologic) exhibited higher misclassification
            emphasizes the model’s robustness and high recall, critical for minimizing   rates, emphasizing the need for granular performance reporting
            missed diagnoses, while precision shows a slight decline on external data   across morphological  categories. The model’s reliance on
            due to increased false positives.
                                                               single-institution  datasets—despite  k-fold  validation—
                                                               limited its capacity to generalize across demographic and
            performance across all metrics. This drop is expected due   regional variations in fracture etiology, echoing concerns in
            to differences in data distribution between the external   clinical trial generalizability.  Future studies should stratify
                                                                                     5
            dataset and the training data. The lower precision indicates   results by fracture type and demographic covariates to better
            a higher rate of FPs, which could lead to unnecessary   quantify real-world applicability.
            follow-up investigations in clinical settings.
                                                                 While  earlier  studies  in AI-driven  fracture  detection
              Across all datasets, recall remains consistently high,   frequently emphasize high diagnostic accuracy, our
            with the external dataset achieving a value of 94.2%. This   results underscore the necessity of evaluating such claims
            suggests the model is effective at identifying fractures,   against methodological transparency and validation rigor.
            which is critical in minimizing missed diagnoses.   Prior research often reports exceptional performance
            Precision, however, decreases more significantly in the   metrics derived from protocols that insufficiently address
            external dataset, highlighting  the model’s  tendency  to   data leakage or domain heterogeneity—a limitation
            produce more FPs when applied to data from a different   exemplified by the reliance on single-institution datasets
            distribution. Hence, the trends as shown in  Figure  6   and inconsistent reporting of validation practices. In
            reflect a strong model performance on internal datasets,   contrast,  our approach prioritized strict  separation  of
            with a predictable decline in metrics when applied to   training, validation, and testing phases, supplemented
            external data. This emphasizes the importance of external   by external validation to assess real-world applicability.
            validation and the need for careful consideration of FPs in   This framework aligns with recent regulatory guidelines,
            clinical applications.                             including the Food and Drug Administration (FDA)’s
                                                               emphasis on robust validation practices for AI models in
            4. Discussion                                      clinical settings. 29
            The model’s diagnostic performance and limitations were   The observed performance gradient between internal
            analyzed through four critical aspects, assessing how   and external evaluations reflects methodological
            technical achievements in controlled validation translate to   divergences (i.e., systematic differences in validation
            clinical utility amidst real-world heterogeneity in imaging   frameworks, data handling, or evaluation criteria across
            protocols,  population  characteristics, and operational   studies) rather than algorithmic shortcomings. While
            constraints.                                       many existing models derive metrics from optimistically

            4.1. Performance and generalizability              partitioned data, our use of cross-validation and
                                                               architectural safeguards—such as dropout and batch
            The CNN demonstrated robust diagnostic performance,   normalization—reduced overfitting risks and maintained
            achieving 95.8% validation accuracy and 94.5% test   training stability. External validation revealed predictable
            accuracy, with k-fold cross-validation (k = 5) confirming   declines in precision, a pattern consistent with domain
            stability (95% average accuracy). However, the 8.3% decline   shift challenges seen across medical AI. These findings
            in external test accuracy (91.7%) underscores contextual   highlight a broader issue: nominal accuracy disparities
            challenges in generalizability. This gradient—from   often signal discrepancies in validation practices rather
            validation to external  testing—aligns with  the expected   than true model capabilities.


            Volume 4 Issue 3 (2025)                         90                              doi: 10.36922/gtm.8526
   93   94   95   96   97   98   99   100   101   102   103