Page 115 - AIH-1-4
P. 115

Artificial Intelligence in Health                        Complex early diagnosis of MS through machine learning




                                                  Five-fold cross-validaon               Evaluaon



                            Preprocess  Hyperparameters tuning                    Concatenate predicons  metrics



                                                                                           Explain


            Figure 1. Diagram of the process for training and testing six classifiers for the prediction of clinically definite multiple sclerosis (CDMS) from clinically
            isolated syndrome (CIS) data

            performance. After training and evaluating the models, we   the Initial_EDSS and Final_EDSS columns since they
            analyzed each model’s feature importance to determine the   contain values exclusively for the CDMS class and null
            most influential features in the predictions. By examining,   values for the non-CDMS class. This discrepancy can lead
            the features  that were consistently ranked highly across   to overfitting, allowing the model to achieve perfect AUC
            different models, we gained insights into the key factors   without considering other features; thus, we excluded these
            that contribute to the diagnosis. We then utilized the   columns from our study.
            SHAP library to further analyze the interactions between   To address the missing values, we imputed the Initial_
            features. SHAP values help us understand not only the   Symptom  column with the  mode  value, acknowledging
            individual impact of each feature but also how features   that these numbers represent categorical data. In addition,
            interact with each other, providing a deeper understanding   we imputed the schooling column with the median value
            of the data’s underlying patterns. We used software tools   due to its numerical nature. Specifically, there was one
            such as Python’s Pandas and Numpy libraries for data   missing value in each of these columns.
            manipulation and basic statistical computations, and
            libraries such as Scikit-learn and Scipy for model building,   To enhance the interpretability and granularity of our
            evaluation, and statistical testing. In addition, we employed   data, we split some columns into multiple binary columns.
            visualization tools such as Matplotlib and Seaborn to   The Initial_Symptom column, which contains values from
            illustrate data analysis and model performance metrics.  1 to 15, indicating the presence of one or more symptoms,
                                                               was divided into four binary columns: Symptom_Visual,
            2.1. Data                                          Symptom_Sensory, Symptom_Motor, and Symptom_
            We used a public dataset of 273 diagnosed patients with   Others. All conversions had been verified by our expert
            CIS  from 2006  to 2010,  which includes  clinical and   neurologists and can be used during the clinical diagnostic
                                 52
            neuroimaging data from the first CIS episode and a   process. This transformation allowed us to analyze the effect
            10-year follow-up. These patients were monitored over a   of each specific symptom on the prediction independently,
            period to observe whether they developed MS. The dataset   thereby improving our understanding of symptom-specific
            includes a variety of features that are potentially relevant   impacts on the diagnosis. Similarly, we split the “Mono
            for predicting the progression to MS. These features   or Polysymptomatic” column into two binary columns:
            encompass clinical characteristics such as the type of initial   Mono_Symptomatic and Poly_Symptomatic, which
            symptoms, the presence of specific neurological signs, and   differentiates between patients exhibiting a single symptom
            results from diagnostic tests like MRI scans. In addition,   versus  multiple  symptoms.  This  split  provides  a  clearer
            demographic information such as age, gender, and medical   representation of symptom complexity in our dataset.
            history were also included to provide a holistic view of   Moreover, we remapped certain numerical columns
            each patient’s profile. Patients were classified as CDMS or   to a binary format to standardize the data and ensured
            non-CDMS based on the McDonald 2010 criteria.  These   consistency in the model input. Specifically, we remapped
                                                    12
            features are described in Table 1.                 Oligoclonal_Bands, Gender, Breastfeeding, and Varicella
                                                               columns to binary values where 0 indicates a negative
            2.2. Preprocessing                                 response, 1 indicates a positive response, and -1 represents
            Our preprocessing steps included several critical   unknown  values.  This  binary  transformation  simplified
            transformations and imputations to ensure robust model   these  categorical  features  into  a  consistent  format  that
            performance and prevent data leakage. First, we removed   could be easily interpreted by ML algorithms. It also


            Volume 1 Issue 4 (2024)                        109                               doi: 10.36922/aih.4255
   110   111   112   113   114   115   116   117   118   119   120