Page 29 - AIH-1-4
P. 29

Artificial Intelligence in Health                                Optimized clustering in medical app detection



            number of clusters and identifying clusters of various   The  proposed  methodology  integrates  two  main
            shapes and sizes. However, the particular data used in this   components: The  ANN  and the  K-means clustering
            work are spherical, and hence, no improvement is seen   algorithm.
            in using alternative clustering algorithms. To address the   The ANN is represented as a function ANN: f  : R  →
                                                                                                           m
            challenges posed by outliers, the non-deterministic nature   R , where m is the dimensionality of the feature space and k
                                                                                                      ANN
                                                                k
            of K-means, and the potential issue of points near cluster   is the number of classes. The ANN learns complex patterns
            boundaries being assigned to different clusters, several   and relationships within dataset X to classify medical apps
            strategies are employed. Before clustering, DBSCAN,   into different categories.
            the  outlier  detection  technique,  is  applied  to  identify
            and remove outliers from the dataset. This helps prevent   The K-means clustering algorithm partitions the dataset
            outliers from unduly influencing the clustering results and   X into K clusters, C= C , C , where each cluster represents
                                                                                 1
                                                                                    K
            improves the robustness of the algorithm. Furthermore, as   data points that share similarities.
            already discussed, instead of relying on a single random   Let µ = µ , µ , µ  denote the centroids of the clusters.
                                                                         1
                                                                               K
                                                                            2
            initialization for the centroids, the algorithm can be run   The  objective  of  K-means  clustering  is  to  minimize  the
            multiple times with different initializations. By averaging   within-cluster sum of squared distances, which can be
            the results or selecting the best clustering solution based   formulated as Equation I:
            on a predefined criterion, the impact of the initial centroid
            choice can be mitigated. In addition, post-clustering   Minimise ,  K  x                   (I)
            and post-processing techniques such as cluster merging,      C         i
                                                                             i1 
            splitting, or reassignment based on proximity or density           xC i
            can be applied to refine the clustering solution and   The  neural  network  aids  in  fixing  the  centroids  µ
            address any misassignments near cluster boundaries.   of  each  cluster  within  the  K-means  clustering  process,
            This can help improve the overall quality of the clustering   contributing to superior detection performance.
            results. However, these techniques were not used in our   To assess the effectiveness of the proposed method,
            experiments, as we did not encounter such misassignments.  various performance metrics were employed, including
              Effective management of hyperparameter tuning for   but not limited to accuracy, precision, recall, F1 score,
            ANNs used in conjunction with K-means clustering   specificity, and area under the receiver operating
            was  crucial  in  this  study.  We  utilized  cross-validation   characteristics (ROC) curve (area under the ROC curve).
            techniques, such as k-fold cross-validation, to assess the   These  metrics  provide  a  quantitative  measure  of  the
            performance of different hyperparameter configurations   system’s efficiency and enable a comprehensive evaluation
            on multiple subsets of the data. This approach ensures   of its detection capabilities.
            that the hyperparameters selected (app name, description,   The theoretical formulation presented here encompasses
            category [e.g., medical, fitness, and wellness], developer   several key aspects related to the challenges and approaches
            information, etc.) generalize well to unseen data and   to app detection using machine-learning techniques:
            mitigate the risk of overfitting.
                                                               (i)  Parametric  versus  non-parametric  classifiers:
              The section also introduces the performance matrices
            employed to assess the effectiveness of the proposed   Parametric classifiers, reliant on model building, are
                                                                  noted for their sluggishness, posing challenges for
            method. The chosen performance metrics provide a
            quantitative measure of the system’s efficiency, allowing for   real-time app detection. Non-parametric classifiers,
            a comprehensive evaluation of its detection capabilities.  while requiring a set of training data for estimating
                                                                  app  distribution,  suffer from  the  drawback of
            4.1. Theoretical formulation                          necessitating substantial training data for effective
                                                                  model-building
            The proposed methodology involves utilizing the ANN to
            classify medical apps and the K-means clustering algorithm   (ii)  Accuracy paradox: Machine learning techniques
            to partition the dataset into clusters, with the neural   typically prioritize the detection of known classes,
            network aiding in determining the centroids of the clusters.   leading to satisfactory prediction accuracy for
            The performance of the method is evaluated using a set of   established apps.  The  tendency  to  concentrate  on
            performance metrics to assess its detection capabilities.  larger classes with known apps can result in poor
                                                                  prediction accuracies for novel or zero-day apps
              Let X= x , x , x  denote the dataset consisting of n data
                        2
                          n
                     1
            points, where each xi represents a feature vector describing   (iii) Zero-day application clustering: Even in cases such
            a medical application.                                as clustering, the focus would be placed on labeling
            Volume 1 Issue 4 (2024)                         23                               doi: 10.36922/aih.2585
   24   25   26   27   28   29   30   31   32   33   34