Page 122 - DP-2-3
P. 122
Design+ ML for predicting Alzheimer’s progression
from Scikit-learn, were employed for modeling and feature. Furthermore, noisy values of “−4”—recurrent
9
evaluation purposes. across multiple columns—were identified and replaced
with NaN. Concurrently, redundant columns such as
4.2. Data understanding RID, SITEID, VISCODE, EXAMDATE, EXAMYEAR,
In the second phase of the methodology, we began by APTESTDT, and PTDOB, among others, were eliminated
familiarizing ourselves with the collected data using a to streamline the dataset for analysis.
comprehensive data dictionary, which outlined feature In the exploratory data analysis, the distribution of
descriptions and properties. The dataset comprised eight output classes was visualized, revealing a significant class
distinct CSV files, imported into Google Colab via file imbalance among HC, MCI, and AD. Specifically, HC
synchronization from Google Drive using the PyDrive emerged as the predominant class with 609 instances,
library in Python. Subsequently, these CSV files were followed by MCI with 144 instances and AD with 105
10
merged to construct a master dataframe, facilitated by instances. A subsequent review of summary statistics
shared key columns such as RID, SITEID, and VISCODE, for numerical features revealed slight discrepancies in
resulting in a unified dataframe containing 1,688 rows and feature counts, suggesting the presence of missing values.
36 columns. Moreover, notable differences in scales and variances were
Given our focus on baseline data, we filtered observed across many features.
the dataset for baseline entries using the VISCODE Upon delving further into the distributions of numerical
column, yielding 862 observations. To prepare for pre- features, distinctive patterns were observed. Variables such
analysis, we systematically transformed several features as AXT117, BAT126, and HMT7, alongside RCT6 and
into categorical formats based on predefined values. RCT11, displayed a notable tendency toward higher values,
Medical history variables—including MHPSYCH, suggesting a right-skewed distribution. Similarly, RCT392
MH2NEURL, MH4CARD, MH6HEPAT, MH8MUSCL, exhibited a comparable pattern, indicating a concentration
MH9ENDO, MH10GAST, MH12RENA, MH16SMOK, of data at the lower end with potential outliers extending
and MH17MALI—were categorized as “No” or “Yes” toward higher values.
based on their respective binary values. Apolipoprotein E In contrast, the distributions of HMT13, HMT40,
(ApoE) genotypes (e.g., APGEN1, APGEN2) were labeled HMT100, HMT102, RCT20, RCT392, and AGE showed
as “E2,” “E3,” or “E4,” corresponding to their genetic a unimodal pattern, indicative of relatively normal
variants. MMSCORE was segmented into severity levels distributions with a pronounced peak at the center.
(e.g., “Severe,” “Moderate,” “Mild,” “Normal”) based on This characteristic suggests the presence of a central
predefined score ranges. PTGENDER was categorized as value around which the data clusters. Furthermore,
“Male” or “Female” according to gender data. CDGLOBAL LIMMTOTAL displayed a unimodal distribution with an
was classified into health status categories (e.g., “Healthy,” additional smaller peak, while LDELTOTAL exhibited a
“Very Mild,” “Mild,” “Moderate,” “Severe”) based on similar pattern with a slightly less distinct secondary peak.
clinical assessment scores. DXCURREN was mapped to
clinical stages (e.g., “HC,” “MCI,” “AD”) using a predefined The analysis was extended using box plots to assess the
mapping dictionary. These transformations enhance the spread of numerical variables. Except for LIMMTOTAL,
interpretability of the dataset by aligning feature values LDELTOTAL, and AGE, potential outliers were observed
with clinically relevant categories for subsequent analysis. in the remaining variables at both ends of the distribution.
During this preparatory phase, it was observed that 2.28% To assess multicollinearity, a correlation matrix was
11
of the data were missing; however, no duplicates were constructed and visualized using a heatmap (Figure 2).
detected. LIMMTOTAL and LDELTOTAL exhibited a strong
Before the exploratory data analysis, a few data cleaning positive correlation, indicating a close relationship between
procedures were performed to enhance the interpretability these variables. Additionally, HMT3 and HMT40, HMT100
of the findings. This step was essential to ensure the and HMT102, as well as RCT6 and RCT392, demonstrated
accuracy and reliability of the analyses by removing any strong positive correlations, further highlighting
inconsistencies and inaccuracies within the dataset. interdependencies within the dataset. Conversely, strong
Initially, the age of patients was calculated by comparing negative correlations were observed between HMT100 and
examination dates with their respective birthdates. This HMT3, HMT40 and HMT3, as well as HMT13 and HMT3,
process involved cleansing the date of birth column to suggesting inverse relationships between these variables.
remove unnecessary characters, followed by the creation of Finally, the association between categorical variables and
the EXAMYEAR column to compute the age as a distinct the target variables was evaluated. As shown in Figure 3,
Volume 2 Issue 3 (2025) 4 doi: 10.36922/DP025270031

