Page 124 - DP-2-3
P. 124
Design+ ML for predicting Alzheimer’s progression
influential features for accurate predictions. By combining and ability to handle large datasets. It sequentially builds
the permutation feature importance technique with an a strong predictive model by aggregating the predictions
16
RF classifier over 150 iterations, this analysis revealed the of multiple weak decision trees. Through advanced
significance of specific features in influencing predictive feature selection and regularization techniques, XGBoost
accuracy, thereby guiding further modeling decisions. The minimizes overfitting and improves model performance.
importance of each feature was systematically assessed to Two models were developed for the prepared data:
ensure a comprehensive understanding of its contribution baseline models and their fine-tuned equivalents. For
to the overall predictive capability of the model. fine-tuning, the “RandomizedSearchCV” function was
Subsequently, feature selection was performed to used. This method selects random combinations of
23
streamline computational resources and optimize model hyperparameter values from a grid, trains the model on a
performance. Using an RF feature selection technique subset of the training data, and evaluates its performance on
17
with 100 estimators and a maximum depth of 5, the a different subset using cross-validation. The combination
algorithm evaluated each feature’s contribution to that yields the best performance metric represents the
impurity reduction (Gini impurity) before decision tree optimized set of hyperparameters for the model.
construction, thereby identifying the most significant In addition, three distinct diagnosis classifiers were
features for predictive modeling. By selecting the most developed to identify the most reliable method for reducing
informative, non-redundant features, data utilization was the number of tests required for disease detection, thereby
optimized, resulting in improved computational efficiency lowering diagnostic costs. These classifiers utilize medical
and enhanced model performance.
history variables, blood analysis, ApoE genotype variables,
Addressing class imbalance, a common challenge in and neuropsychological assessment variables individually.
ML, was essential to ensuring model robustness across all To model these classifiers, we employed the fine-tuned
classes. The Synthetic Minority Over-Sampling Technique version of the best-performing algorithm, ensuring
(SMOTE) was applied to oversample minority classes optimal predictive performance. This approach aims
18
(e.g., MCI and AD) by generating synthetic samples, to streamline the diagnostic process while maintaining
yielding a balanced representation across all classes. diagnostic accuracy.
Following resampling, further adjustments were made to
facilitate model training and evaluation, resulting in the 4.5. Model evaluation
creation of two distinct dataframes for analysis. In this phase, a comprehensive evaluation of the models
was conducted to guide future actions. Predictions from
4.4. Modeling
all models were compared against actual values using the
In selecting ML models during the data preparation “classification_report” function. The evaluation included
24
phase, non-parametric models were prioritized due to accuracy, as well as weighted-average precision, recall, and
19
their flexibility in handling complex datasets. Outliers, F1-score, offering a detailed overview of overall model
multicollinearity, and skewness were identified as key performance. This approach accounts for class imbalances,
challenges that were unaddressed in the previous phase. ensuring robustness across all classes. In addition,
25
Therefore, tree-based models were considered suitable macro-average and class-wise performance metrics were
due to their adaptability to such issues. For multiclass emphasized when further insights were required. Given
classification, the RF and XGBoost algorithms were the research focus of this study, the deployment phase was
selected. 20,21 omitted. A detailed analysis of the models is presented in
21
RF is an ensemble learning algorithm that combines the Results and Discussion section.
multiple decision trees to yield more accurate and reliable 5. Results
predictions. By training each decision tree on randomly
selected subsets of the training data, RF reduces overfitting 5.1. Feature importance
and enhances model generalizability. The Chi-square test results revealed a significant association
XGBoost, commonly known as XGBoost, is another between the target variable and two key features,
22
powerful algorithm in the gradient boosting family. CDGLOBAL and MMSCORE, as indicated by their strong
XGBoost is a widely used open-source software library that chi-square statistics and extremely low p-values. This
implements a gradient boosting algorithm. It is commonly finding was further confirmed by the permutation feature
applied to ML tasks such as classification, regression, importance test. Figure 4 shows the feature importance
and ranking, particularly when dealing with tabular or ranking, highlighting CDGLOBAL as the most influential
structured data. XGBoost is known for its speed, efficiency, feature, followed by LDELTOTAL, MMSCORE, and
Volume 2 Issue 3 (2025) 6 doi: 10.36922/DP025270031

