Page 123 - DP-2-3
P. 123

Design+                                                             ML for predicting Alzheimer’s progression



            CDGLOBAL  and  MMSCORE  displayed  significant  Chi-  Overall, the exploratory data analysis identified several
            square statistics  with extremely low p-values, indicating   areas for improvement, including class imbalance, missing
                        12
            a robust association with the target variable. This suggests   values, outliers, multicollinearity, and skewed distributions.
            that these variables hold substantial predictive power with
            respect to the target outcome. In addition, MH2NEURL,   4.3. Data preparation
            APGEN1,  and  APGEN2  exhibited  moderate  chi-square   Data preparation, the third phase of the CRISP-DM
            statistics accompanied by small  p-values, indicating a   methodology, began by following the basic data cleaning
            noticeable association with the target variable, although not   steps conducted during the data understanding phase.
            as strong as that of CDGLOBAL and MMSCORE. However,   The initial step involved converting categorical data into
            MH8MUSCL and PTGENDER demonstrated relatively      a numerical format to facilitate model development.
            smaller Chi-square statistics along with higher  p-values,   However, it was noted that the “pd.factorize”  function
                                                                                                    13
            suggesting a weaker association with the target variable.  assigned “−1” in place of missing values, necessitating
                                                               further replacement with NaN values to enable imputation
                                                               at a later stage.
                                                                 To prevent data leakage and assess the model’s efficacy
                                                               in generalizing to previously unseen data, a critical
                                                               first step was to split the data before implementing any
                                                               preprocessing techniques.  The data was split in an 80:20
                                                                                    14
                                                               ratio, allocating 80% for training and the remaining 20%
                                                               for testing. Subsequently, we focused on handling missing
                                                               values within the training dataset, acknowledging the
                                                               potential impact on predictive accuracy due to data loss
                                                               if inadequately addressed. To address this, the MissForest
                                                               imputation technique,   an algorithmic  approach that
                                                                                 15
                                                               initially uses mean and  mode  values  to  replace  missing
                                                               data, was applied. This was followed by the implementation
                                                               of an RF methodology to iteratively predict missing values,
                                                               prioritizing data accuracy over processing speed.
                                                                 Given the high dimensionality of the dataset, an analysis
            Figure 2. Correlation matrix heatmap of numerical features  of feature importance was conducted to determine the most






























                                         Figure 3. Visualization of the results from the Chi-square test


            Volume 2 Issue 3 (2025)                         5                            doi: 10.36922/DP025270031
   118   119   120   121   122   123   124   125   126   127   128