Page 31 - GTM-3-3
P. 31
Global Translational Medicine Computational advances in cancer liquid biopsy
biophysical properties (i.e., size, deformability, density). synthetic samples can help balance the class distribution,
Label-free CTC isolation technologies should overcome leading to better model performance. The significance of
the main limitations of marker-based strategies, such data preprocessing (data cleaning, filling missing values,
as the need for prior knowledge of the precise protein removing redundant information, and detecting outliers)
composition on the surfaces of CTCs and the absence of in ML cannot be overstated. It serves as a critical initial
universal markers capable of identifying all heterogeneous step that greatly influences the effectiveness and reliability
CTCs in the bloodstream. of the entire modeling process. Proper data preprocessing
can help prevent overfitting by reducing noise and
The growing application of ML in clinical practice
emphasizes the need for research centers and hospitals to irrelevant information in the data, enabling the model
utilize instruments that generate high-quality, properly to generalize better to unseen data. Batch effects may be
formatted, and easily accessible CTC imaging data. another potential concern when applying models to data
Expanding the CTC image scope with a multi-omic cell from different sources (e.g., processed with different
map is a key focus for the future, but several obstacles are instruments and laboratory protocols), as they can obscure
hampering the effective integration of ML into clinical biologically important features in the data in favor of
technical, systematic, and, therefore, irrelevant ones.
practice.
At present, most AI models employed in healthcare fall
Among the most common pitfalls of current ML 112
approaches to liquid biopsy data analysis, or clinical data short when subjected to reproducibility assessments. This
discrepancy arises from multiple factors, encompassing the
analysis in general, are: proper sample size, data bias and scarcity of publicly accessible medical datasets for model
imbalance, preprocessing, feature selection, overfitting, training, validation, and testing, the absence of standardized
generalizability, and algorithm portability. Large amounts data collection protocols, overfitting, and data leakage, where
of reference data are required to build robust classifiers. training and testing data overlap, leading to an inflated, overly
As a consequence, most studies have been performed in optimistic perception of model performance. Enhanced
the context of common cancer types. This highlights the algorithmic transparency that adheres to the FAIR principles
need for concerted profiling efforts and data sharing in the (Findable, Accessible, Interoperable, and Reusable) in AI
context of rare cancers to fill the gap. Often, the amount holds the potential to expedite innovation and significantly
of data collected in a single institution is insufficient to amplify the practicality of ML in healthcare. This entails
113
unravel the complexity of a problem. Federated learning making both the code and data available in a meticulous and
111
(FL) may help address this issue. FL is a paradigm that error-free manner.
seeks to address the limited number of samples available in
single research centers and the problem of data governance Finally, ML model explainability is particularly
and patient privacy by collaboratively training algorithms important in healthcare, where any decision carries
across multiple decentralized edge devices, pooling data significant risk. Black-box, complicated models do
from numerous sources without exchanging the data not always outperform traditional, more interpretable
114
itself, after addressing data heterogeneity, data quality ones. Simplifying AI models, when possible, to enhance
and consistency, and interoperability issues. Typically, understanding by clinicians would improve trust in the
patients receive treatment within their local vicinity. method.
Implementing FL on a worldwide level has the potential to
improve inclusion and guarantee equal opportunities and 8. Digital twins
excellent clinical judgments for patients, irrespective of the From the perspective of personalized medicine, the aim
treatment site. This would be advantageous for patients is to incorporate all relevant information from a patient’s
with rare medical conditions, for whom the likelihood healthcare record into the multi-omic cell map, creating
of less severe outcomes increases with quicker and more what is known as a patient’s “digital twin” (Figure 2). By
precise diagnoses, as well as for patients needing medical developing complex mathematical models tailored to the
care in distant and underserved regions, as they could specific patient’s journey, the molecular features of the
access the same top-tier ML-assisted diagnoses available in patient’s tumor tissues, and biological fluids data, these
hospitals with extensive caseloads. digital twins could mimic disease dynamics, potentially
Handling imbalanced data is also critical. When one contributing to diagnostic improvements and more
class in a classification problem has significantly fewer effective tumor screening.
samples than the other (e.g., samples collected from At present, computational representations of real-world
healthy controls vs. patients), preprocessing techniques objects, systems, and processes, known as digital twins, are
such as oversampling, undersampling, or generating being developed across various branches of healthcare.
115
Volume 3 Issue 3 (2024) 8 doi: 10.36922/gtm.3063

