Page 95 - GTM-4-3
P. 95

Global Translational Medicine                          CNNs for overfitting and generalizability in fracture detection



            generalization capacity. The Adam optimizer’s adaptive   11:  Backpropagate gradients and update parameters using
            learning rates mitigated gradient instability during early   Adam
            training phases, while the conservative initial learning   12:  end for
            rate ensured fine-grained parameter updates critical for   13:  Evaluate model on D val
            distinguishing subtle fracture phenotypes. Restricting   14:  end for
            training to 10 epochs prevented over-optimization to   15:  Compute validation accuracy for fold i
            transient batch-level noise, as evidenced by stabilized   16:  end for
            validation loss trajectories. By applying dropout exclusively   17:  Compute cross-validation accuracy (Eq. 5)
            during training, and disabling it during validation, the   18:  Output: Trained model, cross-validation accuracy
            metrics accurately reflected the model’s inherent diagnostic
            capability rather than transient regularization effects.  2.6. Evaluation and testing
                                                               The final model was evaluated on the test dataset and an
            2.4. k-fold cross-validation                       external dataset to assess generalizability. Performance
            To assess the model’s generalizability, a  k-fold cross-  metrics, including accuracy, sensitivity, specificity, and
            validation strategy was employed with k = 5. The dataset   confusion matrices, were computed. The accuracy was
            was split into k partitions, and the model was trained and   calculated using a formula that accounts for the binary
            validated  on  k−1  folds  while  testing  on  the  others.  The   classification nature of the problem. Specifically, accuracy
            process was repeated for all folds, and the cross-validation   was defined as the ratio of correctly classified samples to
            accuracy was computed as Equation IV:              the total number of samples, incorporating true positives
                                                               (TPs), true negatives (TNs), false positives (FPs), and
                        1   k                                  false negatives (FNs). This is expressed mathematically as
            CV accuracy =  ∑  accuracy                 (IV)
                        k   i=1      i                         Equation V:
              where accuracy  is the validation accuracy for fold i.         TP TN
                           i                                   Accuracy                                  (V)


              The  k-fold approach probed model stability under          TP TN  FP FN
            variations in data composition. By cyclically excluding   In this context, TP represents the number of fractured
            distinct patient subgroups during training, the method   cases correctly identified as fractured, while TN denotes
            simulated multicenter validation scenarios and quantified   the number of non-fractured cases correctly identified
            performance variance attributable to sampling biases.   as non-fractured. FP corresponds to non-fractured cases
            Repeated retraining across folds ensured architectural   incorrectly classified as fractured, and FN represents
            decisions  generalized  beyond  feature  distributions  in   fractured cases incorrectly classified as non-fractured.
            individual splits. This process mirrored clinical reality,   This formulation provides a comprehensive measure of the
            where AI tools must maintain diagnostic fidelity across   model’s performance by considering all possible outcomes
            heterogeneous patient populations and acquisition   in the classification process.
            protocols.
                                                                 To further evaluate the model’s performance, additional
            2.5. Algorithm pseudo-code                         metrics were computed. Precision, which measures the
                                                               proportion of correctly identified positive cases out of all
            The  training  procedure,  including  cross-validation,  is   predicted positive cases, was calculated using Equation VI:
            summarized in Algorithm 1.
            Algorithm 1. Training and cross-validation procedure  Precision   TP                         (VI)
            1:  Input: Dataset D, number of folds k, number of epochs   TP FP
               E, mini-batch size B                              Recall, also referred to as sensitivity or the TP rate,
            2:  Split D into k folds                           was used to assess the model’s ability to identify all actual
            3:  for i = 1 to k do                              positive cases. It is defined as Equation VII:
            4:  Assign i-th fold as the validation set D , remaining
                                               val
               folds as training set D train                            TP
            5:  Initialize CNN model parameters                Recall  TP FN                            (VII)

            6:  for epoch = 1 to E do
            7:  Divide D train  into mini-batches of size B      To balance precision and recall, the F1-score was
            8:  for each mini-batch (X, y) do                  computed as the harmonic mean of these two metrics,
            9:  Perform forward pass to compute predictions ŷ  providing a single measure that accounts for both FPs and
            10:  Compute loss L (Eq. 4)                        FNs. The F1-score is given by Equation VIII:


            Volume 4 Issue 3 (2025)                         87                              doi: 10.36922/gtm.8526
   90   91   92   93   94   95   96   97   98   99   100