Page 31 - GTM-3-3
P. 31

Global Translational Medicine                                  Computational advances in cancer liquid biopsy



            biophysical properties (i.e., size, deformability, density).   synthetic samples can help balance the class distribution,
            Label-free CTC isolation technologies should overcome   leading to better model performance. The significance of
            the main limitations of marker-based strategies, such   data preprocessing (data cleaning, filling missing values,
            as the need for prior knowledge of the precise protein   removing redundant information, and detecting outliers)
            composition on the surfaces of CTCs and the absence of   in ML cannot be overstated. It serves as a critical initial
            universal markers capable of identifying all heterogeneous   step that greatly influences the effectiveness and reliability
            CTCs in the bloodstream.                           of the entire modeling process. Proper data preprocessing
                                                               can  help  prevent  overfitting  by  reducing  noise  and
              The  growing  application  of  ML  in  clinical  practice
            emphasizes the need for research centers and hospitals to   irrelevant  information  in  the  data,  enabling  the  model
            utilize instruments that generate high-quality, properly   to generalize better to unseen data. Batch effects may be
            formatted, and easily accessible CTC imaging data.   another potential concern when applying models to data
            Expanding the CTC image scope with a multi-omic cell   from different sources (e.g., processed with different
            map is a key focus for the future, but several obstacles are   instruments and laboratory protocols), as they can obscure
            hampering the effective integration of ML into clinical   biologically important features in the data in favor of
                                                               technical, systematic, and, therefore, irrelevant ones.
            practice.
                                                                 At present, most AI models employed in healthcare fall
              Among the most common pitfalls of current ML                                              112
            approaches to liquid biopsy data analysis, or clinical data   short when subjected to reproducibility assessments.  This
                                                               discrepancy arises from multiple factors, encompassing the
            analysis in general, are: proper sample size, data bias and   scarcity of publicly accessible medical datasets for model
            imbalance, preprocessing, feature selection, overfitting,   training, validation, and testing, the absence of standardized
            generalizability, and algorithm portability. Large amounts   data collection protocols, overfitting, and data leakage, where
            of reference data are required to build robust classifiers.   training and testing data overlap, leading to an inflated, overly
            As a consequence, most studies have been performed in   optimistic  perception of  model  performance.  Enhanced
            the context of common cancer types. This highlights the   algorithmic transparency that adheres to the FAIR principles
            need for concerted profiling efforts and data sharing in the   (Findable, Accessible, Interoperable, and Reusable) in AI
            context of rare cancers to fill the gap. Often, the amount   holds the potential to expedite innovation and significantly
            of data collected in a single institution is insufficient to   amplify the practicality of ML in healthcare.  This entails
                                                                                                  113
            unravel the complexity of a problem. Federated learning   making both the code and data available in a meticulous and
               111
            (FL)  may help address this issue. FL is a paradigm that   error-free manner.
            seeks to address the limited number of samples available in
            single research centers and the problem of data governance   Finally, ML model explainability is particularly
            and patient privacy by collaboratively training algorithms   important in healthcare, where any decision carries
            across multiple decentralized edge devices, pooling data   significant  risk.  Black-box,  complicated  models  do
            from numerous sources without exchanging the data   not always outperform traditional, more interpretable
                                                                   114
            itself, after addressing data heterogeneity, data quality   ones.  Simplifying AI models, when possible, to enhance
            and consistency, and interoperability issues. Typically,   understanding  by  clinicians  would  improve  trust  in  the
            patients receive treatment within their local vicinity.   method.
            Implementing FL on a worldwide level has the potential to
            improve inclusion and guarantee equal opportunities and   8. Digital twins
            excellent clinical judgments for patients, irrespective of the   From the perspective of personalized medicine, the aim
            treatment site. This would be advantageous for patients   is to incorporate all relevant information from a patient’s
            with rare medical conditions, for whom the likelihood   healthcare record into the multi-omic cell map, creating
            of less severe outcomes increases with quicker and more   what is known as a patient’s “digital twin” (Figure 2). By
            precise diagnoses, as well as for patients needing medical   developing complex mathematical models tailored to the
            care in distant and underserved regions, as they could   specific patient’s journey, the molecular features of the
            access the same top-tier ML-assisted diagnoses available in   patient’s tumor tissues, and biological fluids data, these
            hospitals with extensive caseloads.                digital twins could mimic disease dynamics, potentially
              Handling imbalanced data is also critical. When one   contributing to diagnostic improvements and more
            class in a classification problem has significantly fewer   effective tumor screening.
            samples than the other (e.g., samples collected from   At present, computational representations of real-world
            healthy  controls  vs.  patients),  preprocessing  techniques   objects, systems, and processes, known as digital twins, are
            such as oversampling, undersampling, or generating   being developed across various branches of healthcare.
                                                                                                            115

            Volume 3 Issue 3 (2024)                         8                               doi: 10.36922/gtm.3063
   26   27   28   29   30   31   32   33   34   35   36