Page 123 - DP-2-3
P. 123
Design+ ML for predicting Alzheimer’s progression
CDGLOBAL and MMSCORE displayed significant Chi- Overall, the exploratory data analysis identified several
square statistics with extremely low p-values, indicating areas for improvement, including class imbalance, missing
12
a robust association with the target variable. This suggests values, outliers, multicollinearity, and skewed distributions.
that these variables hold substantial predictive power with
respect to the target outcome. In addition, MH2NEURL, 4.3. Data preparation
APGEN1, and APGEN2 exhibited moderate chi-square Data preparation, the third phase of the CRISP-DM
statistics accompanied by small p-values, indicating a methodology, began by following the basic data cleaning
noticeable association with the target variable, although not steps conducted during the data understanding phase.
as strong as that of CDGLOBAL and MMSCORE. However, The initial step involved converting categorical data into
MH8MUSCL and PTGENDER demonstrated relatively a numerical format to facilitate model development.
smaller Chi-square statistics along with higher p-values, However, it was noted that the “pd.factorize” function
13
suggesting a weaker association with the target variable. assigned “−1” in place of missing values, necessitating
further replacement with NaN values to enable imputation
at a later stage.
To prevent data leakage and assess the model’s efficacy
in generalizing to previously unseen data, a critical
first step was to split the data before implementing any
preprocessing techniques. The data was split in an 80:20
14
ratio, allocating 80% for training and the remaining 20%
for testing. Subsequently, we focused on handling missing
values within the training dataset, acknowledging the
potential impact on predictive accuracy due to data loss
if inadequately addressed. To address this, the MissForest
imputation technique, an algorithmic approach that
15
initially uses mean and mode values to replace missing
data, was applied. This was followed by the implementation
of an RF methodology to iteratively predict missing values,
prioritizing data accuracy over processing speed.
Given the high dimensionality of the dataset, an analysis
Figure 2. Correlation matrix heatmap of numerical features of feature importance was conducted to determine the most
Figure 3. Visualization of the results from the Chi-square test
Volume 2 Issue 3 (2025) 5 doi: 10.36922/DP025270031

