Page 92 - GTM-4-3
P. 92
Global Translational Medicine CNNs for overfitting and generalizability in fracture detection
The application of AI in radiology has seen remarkable Proper data splitting is essential to developing
progress, with AI-based tools being used to enhance the models that generalize well to unseen data. This involves
accuracy and efficiency of diagnosing bone fractures. partitioning datasets into different subsets, such as
1
These tools are designed to assist radiologists by providing training, validation, and test sets. 14-19 The graph in Figure 1
faster and more consistent fracture identification, which is illustrates the number of studies published each year
crucial for timely and effective treatment. Recent studies using two-way and three-way data splitting strategies
1,2
have demonstrated the capability of AI algorithms to from 2007 to 2022. It highlights a significant shift in the
accurately detect and classify fractures, especially in the research community’s approach to data splitting in ML
wrist and long bones, using X-ray images. 3 studies. In the earlier years, particularly from 2007 to
CNNs have emerged as a cornerstone in medical 2017, most studies employed two-way splitting, where the
imaging analysis, particularly in orthopedics, due to their dataset is divided into a training set and a testing set. This
ability to process and analyze complex image data with method lacks a validation set, which is essential for tuning
4
high accuracy. These networks are structured to mimic hyperparameters and preventing overfitting. Without a
the human visual cortex, allowing them to identify validation set, models may not generalize well to unseen
5
patterns and features in medical images that may be data, leading to inefficient ML training and distorted
difficult for human observers to discern. In the context results. This limits the model’s ability to generalize to
6,7
of bone fracture detection, CNNs have shown promising new, unseen data. Particular attention must be given to
results, with some studies indicating that AI is noninferior avoid data leakage, where information from the test set
to clinicians in terms of diagnostic performance. 8 inadvertently influences model training, leading to inflated
and unreliable performance metrics.
1.1. Common challenges Starting around 2018, the graph shows several studies
Despite the advancements in AI-based fracture detection, adopting three-way splitting. This approach involves
9
several challenges persist in the field. High-quality, splitting the data into three sets: training, validation,
annotated datasets are essential for training effective AI and testing. The validation set is used during model
models. However, there is often a scarcity of such datasets, development to fine-tune hyperparameters and select the
which can limit the performance and generalizability best model before final evaluation on the test set. By 2022,
of fracture detection models. AI models, particularly the number of studies using three-way splitting surpasses
deep learning models, often overfit to the training data, those using two-way splitting, indicating a positive trend
especially when the dataset is small or lacks diversity. 10,11 toward more robust ML practices.
This limits the model’s ability to generalize to new, unseen The increasing adoption of three-way splitting reflects
data. a growing awareness of the pitfalls of overfitting and the
A persistent challenge in AI-assisted fracture importance of model validation. Without a validation set,
detection lies not in achieving high nominal accuracy there is a risk of inadvertently tuning the model to perform
but in ensuring that such metrics stem from rigorously well on the test set, which can lead to overly optimistic
validated models capable of real-world generalization. 1,12,13 performance estimates and poor generalization. 17-20 When
Many studies report exceptional performance, yet
methodological shortcomings, such as inadequate data
splitting, insufficient validation protocols (systematic
procedures for evaluating model performance, including
partitioning data into training, validation, and test sets to
prevent overfitting), or reliance on homogeneous datasets,
often inflate internal benchmarks at the expense of clinical
applicability. 14-17 This discrepancy highlights a critical
disconnect; models optimized for accuracy on internal data
may fail catastrophically when confronted with external
populations or operational heterogeneity, a limitation
amplified by inconsistent validation practices across the Figure 1. Yearly trend in the number of studies employing two-way
field. The goal of our study is to directly address this gap and three-way data splitting strategies in artificial intelligence-assisted
9,18
by prioritizing training stability and generalizability over bone fracture detection research (2007–2022). The graph highlights the
increasing adoption of three-way splitting, reflecting improved validation
raw performance through methodological rigor in data practices and model generalizability in machine learning. Data derived
handling and model validation. from Jung et al. 9
Volume 4 Issue 3 (2025) 84 doi: 10.36922/gtm.8526

