Page 122 - DP-2-3
P. 122

Design+                                                             ML for predicting Alzheimer’s progression



            from Scikit-learn,  were employed for modeling and   feature. Furthermore, noisy values of “−4”—recurrent
                           9
            evaluation purposes.                               across multiple columns—were identified and replaced
                                                               with NaN. Concurrently, redundant columns such as
            4.2. Data understanding                            RID,  SITEID, VISCODE, EXAMDATE, EXAMYEAR,

            In the  second phase  of the methodology,  we  began by   APTESTDT, and PTDOB, among others, were eliminated
            familiarizing ourselves with the collected data using a   to streamline the dataset for analysis.
            comprehensive  data  dictionary,  which  outlined  feature   In  the  exploratory  data  analysis,  the  distribution  of
            descriptions and properties. The dataset comprised eight   output classes was visualized, revealing a significant class
            distinct CSV files, imported into Google Colab via file   imbalance among HC, MCI, and AD. Specifically, HC
            synchronization from  Google Drive  using  the  PyDrive   emerged as the predominant class with 609 instances,
            library in Python.  Subsequently, these CSV files were   followed by MCI with 144 instances and AD with 105
                           10
            merged to construct a master dataframe, facilitated by   instances.  A  subsequent  review  of summary statistics
            shared key columns such as RID, SITEID, and VISCODE,   for numerical features revealed slight discrepancies in
            resulting in a unified dataframe containing 1,688 rows and   feature counts, suggesting the presence of missing values.
            36 columns.                                        Moreover, notable differences in scales and variances were
              Given  our  focus  on  baseline  data,  we  filtered   observed across many features.
            the dataset for baseline entries using the VISCODE   Upon delving further into the distributions of numerical
            column, yielding 862 observations. To prepare for pre-  features, distinctive patterns were observed. Variables such
            analysis,  we  systematically  transformed  several  features   as  AXT117,  BAT126,  and  HMT7,  alongside  RCT6  and
            into categorical formats based on predefined values.   RCT11, displayed a notable tendency toward higher values,
            Medical  history  variables—including  MHPSYCH,    suggesting a right-skewed distribution. Similarly, RCT392
            MH2NEURL, MH4CARD, MH6HEPAT, MH8MUSCL,             exhibited a comparable pattern, indicating a concentration
            MH9ENDO, MH10GAST, MH12RENA, MH16SMOK,             of data at the lower end with potential outliers extending
            and MH17MALI—were categorized as “No” or “Yes”     toward higher values.
            based on their respective binary values. Apolipoprotein E   In contrast, the distributions of HMT13, HMT40,
            (ApoE) genotypes (e.g., APGEN1, APGEN2) were labeled   HMT100, HMT102, RCT20, RCT392, and AGE showed
            as “E2,” “E3,” or “E4,” corresponding to their genetic   a unimodal pattern, indicative of relatively normal
            variants. MMSCORE was segmented into severity levels   distributions with a pronounced peak at the center.
            (e.g.,  “Severe,”  “Moderate,”  “Mild,”  “Normal”)  based  on   This characteristic suggests the presence of a central
            predefined score ranges. PTGENDER was categorized as   value around which the data clusters. Furthermore,
            “Male” or “Female” according to gender data. CDGLOBAL   LIMMTOTAL displayed a unimodal distribution with an
            was classified into health status categories (e.g., “Healthy,”   additional  smaller  peak, while  LDELTOTAL  exhibited  a
            “Very  Mild,”  “Mild,”  “Moderate,”  “Severe”)  based on   similar pattern with a slightly less distinct secondary peak.
            clinical assessment scores. DXCURREN was mapped to
            clinical stages (e.g., “HC,” “MCI,” “AD”) using a predefined   The analysis was extended using box plots to assess the
            mapping dictionary. These transformations enhance the   spread of numerical variables. Except for LIMMTOTAL,
            interpretability of the dataset by aligning feature values   LDELTOTAL, and AGE, potential outliers were observed
            with clinically relevant categories for subsequent analysis.   in the remaining variables at both ends of the distribution.
            During this preparatory phase, it was observed that 2.28%   To assess multicollinearity,  a correlation matrix was
                                                                                       11
            of the data were missing; however, no duplicates were   constructed  and  visualized  using  a  heatmap  (Figure  2).
            detected.                                          LIMMTOTAL and LDELTOTAL exhibited a strong
              Before the exploratory data analysis, a few data cleaning   positive correlation, indicating a close relationship between
            procedures were performed to enhance the interpretability   these variables. Additionally, HMT3 and HMT40, HMT100
            of the  findings. This step  was essential to ensure the   and HMT102, as well as RCT6 and RCT392, demonstrated
            accuracy and reliability of the analyses by removing any   strong  positive  correlations,  further  highlighting
            inconsistencies and inaccuracies within the dataset.   interdependencies within the dataset. Conversely, strong
            Initially, the age of patients was calculated by comparing   negative correlations were observed between HMT100 and
            examination dates  with their respective  birthdates.  This   HMT3, HMT40 and HMT3, as well as HMT13 and HMT3,
            process involved cleansing the date of birth column to   suggesting inverse relationships between these variables.
            remove unnecessary characters, followed by the creation of   Finally, the association between categorical variables and
            the EXAMYEAR column to compute the age as a distinct   the target variables was evaluated. As shown in Figure 3,


            Volume 2 Issue 3 (2025)                         4                            doi: 10.36922/DP025270031
   117   118   119   120   121   122   123   124   125   126   127