Page 124 - DP-2-3
P. 124

Design+                                                             ML for predicting Alzheimer’s progression



            influential features for accurate predictions. By combining   and ability to handle large datasets. It sequentially builds
            the permutation feature importance technique  with an   a strong predictive model by aggregating the predictions
                                                  16
            RF classifier over 150 iterations, this analysis revealed the   of multiple weak decision trees. Through advanced
            significance of specific features in influencing predictive   feature selection and regularization techniques, XGBoost
            accuracy, thereby guiding further modeling decisions. The   minimizes overfitting and improves model performance.
            importance of each feature was systematically assessed to   Two models were developed for the prepared data:
            ensure a comprehensive understanding of its contribution   baseline models and their fine-tuned equivalents. For
            to the overall predictive capability of the model.  fine-tuning, the “RandomizedSearchCV” function was
              Subsequently, feature selection was performed to   used.  This method selects random combinations of
                                                                   23
            streamline computational resources and optimize model   hyperparameter values from a grid, trains the model on a
            performance. Using an RF feature selection technique    subset of the training data, and evaluates its performance on
                                                         17
            with 100 estimators and a maximum depth of 5, the   a different subset using cross-validation. The combination
            algorithm evaluated each feature’s contribution to   that yields the best performance metric represents the
            impurity reduction (Gini impurity) before decision tree   optimized set of hyperparameters for the model.
            construction, thereby identifying the most significant   In addition, three distinct diagnosis classifiers were
            features for predictive modeling. By selecting the most   developed to identify the most reliable method for reducing
            informative, non-redundant features, data utilization was   the number of tests required for disease detection, thereby
            optimized, resulting in improved computational efficiency   lowering diagnostic costs. These classifiers utilize medical
            and enhanced model performance.
                                                               history variables, blood analysis, ApoE genotype variables,
              Addressing class imbalance, a common challenge in   and neuropsychological assessment variables individually.
            ML, was essential to ensuring model robustness across all   To  model  these  classifiers,  we  employed  the  fine-tuned
            classes. The Synthetic Minority Over-Sampling Technique   version of the best-performing algorithm, ensuring
            (SMOTE)  was applied to oversample minority classes   optimal  predictive  performance.  This  approach  aims
                    18
            (e.g., MCI and AD) by generating synthetic samples,   to streamline the diagnostic process while maintaining
            yielding a balanced representation across all classes.   diagnostic accuracy.
            Following resampling, further adjustments were made to
            facilitate model training and evaluation, resulting in the   4.5. Model evaluation
            creation of two distinct dataframes for analysis.  In this phase, a comprehensive evaluation of the models
                                                               was conducted to guide future actions. Predictions from
            4.4. Modeling
                                                               all models were compared against actual values using the
            In selecting ML models during the data preparation   “classification_report” function.  The evaluation included
                                                                                        24
            phase, non-parametric models were prioritized due to   accuracy, as well as weighted-average precision, recall, and
                                                 19
            their flexibility in handling complex datasets.  Outliers,   F1-score, offering a detailed overview of overall model
            multicollinearity, and skewness were  identified as key   performance. This approach accounts for class imbalances,
            challenges that were unaddressed in the previous phase.   ensuring robustness across all classes.  In addition,
                                                                                                25
            Therefore, tree-based models were considered suitable   macro-average and class-wise performance metrics were
            due to their adaptability to such issues. For multiclass   emphasized when further insights were required. Given
            classification, the RF and XGBoost algorithms were   the research focus of this study, the deployment phase was
            selected. 20,21                                    omitted. A detailed analysis of the models is presented in
                 21
              RF  is an ensemble learning algorithm that combines   the Results and Discussion section.
            multiple decision trees to yield more accurate and reliable   5. Results
            predictions. By training each decision tree on randomly
            selected subsets of the training data, RF reduces overfitting   5.1. Feature importance
            and enhances model generalizability.               The Chi-square test results revealed a significant association
              XGBoost, commonly known as XGBoost,  is another   between  the  target  variable  and  two  key  features,
                                                22
            powerful algorithm in the gradient boosting family.   CDGLOBAL and MMSCORE, as indicated by their strong
            XGBoost is a widely used open-source software library that   chi-square statistics and extremely low  p-values. This
            implements a gradient boosting algorithm. It is commonly   finding was further confirmed by the permutation feature
            applied to ML tasks such as classification, regression,   importance  test.  Figure  4  shows  the  feature  importance
            and ranking, particularly when dealing with tabular or   ranking, highlighting CDGLOBAL as the most influential
            structured data. XGBoost is known for its speed, efficiency,   feature, followed by LDELTOTAL, MMSCORE, and


            Volume 2 Issue 3 (2025)                         6                            doi: 10.36922/DP025270031
   119   120   121   122   123   124   125   126   127   128   129